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PREFACE 


The Sixth Copper Mountain Conference on Multigrid Methods was held on 
April 4—9, 1993 at Copper Mountain Colorado and was cosponsored by NASA, 
the Air Force Office of Scientific Research, the Department of Energy, and the 
National Science Foundation. The University of Colorado at Denver, Front Range 
Scientific Computations, Inc., and the Society for Industrial and Applied 
Mathematics provided organizational support for the conference. 

This document is a collection of many of the papers that were presented at 
the conference and thus represents the conference proceedings. NASA Langley 
graciously provided printing of this book so that all of the papers could be 
presented in a single forum. Each paper was reviewed by a member of the 
conference organizing committee under the coordination of the editors. 

The multigrid discipline continues to expand and mature, as is evident from 
these proceedings. The vibrancy in this field is amply expressed in these 
important papers, and the collection clearly shows its rapid trend to further 
diversity and depth. 


N. Duane Melson 

NASA Langley Research Center 

Steve F. McCormick and 
Tom A. Manteuffel 
University of Colorado at Denver 


The use of trademarks or manufacturer's names in this publication does not 
constitute endorsement, either expressed or implied, by the National Aeronautics and 
Space Administration. 
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Abstract 

We consider the problem of image reconstruction from a finite number of pro- 
jections over the space L^(Sl), where SI is a compact subset of M ■ We prove that, 
given a discretization of the projection space, the function that generates the cor- 
rect projection data and maximizes the Boltzmann-Shannon entropy is piecewise 
constant on a certain discretization of Si, which we call the “optimal grid”. It is on 
this grid that one obtains the maximum resolution given the problem setup. The 
size of this grid grows very quickly as the number of projections and number of cells 
per projection grow, indicating fast computational methods are essential to make 

its use feasible. 

We use a Fenchel duality formulation of the problem to keep the number of 
variables small while still using the optimal discretization, and propose a multilevel 
scheme to improve convergence of a simple cyclic maximization scheme applied to 
the dual problem. 
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1 Introduction 


In computerized tomography (CT), one encounters the problem of reconstructing an im- 
age, or a density, defined by the function x(s,t), given only a finite number of projection 
data. General references include [1] and [2], 

The projection data is typically of the form 


b™ = I x(s,t)dsdt 

** i- 


( 1 . 1 ) 


™ is the 


where the support of x is assumed to lie in the bounded region Q C R 2 and ft 
k th strip orthogonal to the m th projection (see Figure 1). We assume that there are M 
projections and that the m th projection has K m cells. Let b be the vector of projection 
data, N the length of b , and ip™ the characteristic function of Q™. We then rewrite (1.1) 
as 

b = Ax, (1.2) 

where A : Z- 1 (fI) 


M n via 


(Ax) k>m = [ x(s,t)ip™(s,t)dsdt. 

J O 


(1.3) 


The reconstruction problem we study is: given the projection data b, find a density 
function x such that Ax — b. Since A has an infinite-dimensional kernel, solutions, if they 
exist, are not unique. The problem then becomes: find the “best” function x 0 such that 
Ax 0 = b. The concept of “best” is ambiguous to be sure, but some criteria have been 
gaining acceptance. In this paper we choose to study the solution with maximum entropy 
as defined by Shannon [3] in information theory. For an informal discussion of entropy 
and information theory, see for example [4]. For a discussion of maximum entropy in 
image reconstruction, see [5] . 

In our context, we wish to find the function xq 6 L l { VI) such that Xq attains 


sup j— J x(s, t) ln[x(s, £)]ds dt : Ax = &j . 


(1.4) 


This is the maximum entropy solution to Ax = b. Simply because we would rather 
minimize a convex function than maximize a concave function, we rewrite this as a convex 
minimization program via 


This is called the primal problem-. 
by 


p = inf jy x(s, t) ln[x(s, t)]ds dt : Ax = 6 j . 

If we further define the function <p ■ R 


4>{u) = 



(1.5) 
(-oo, +00] 

( 1 . 6 ) 
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then we can rewrite (1.5) as 

p = inf |y (f>(x(s , t))ds dt : Ax = 6 j , (1-7) 

which is in a form that has been receiving much attention in the optimization community 
of late. In particular, see Borwein and Lewis [6]. One of the features of this functional is 
that it forces feasible functions to be nonnegative, as a density or image function should 
be. We shall see that it has other properties that are computationally and theoretically 
attractive, one of the most important being that solutions exist in L l (O) under rather 
mild conditions. 

In this paper we shall characterize solutions to (1.5) given some reasonable conditions 
on the data, 6, and show that the solution in L\Q) for the CT case is piecewise con- 
stant, but usually not on a rectangular grid. While this result seems to be known in 
the tomography community, it is rare that one finds a mathematically sound derivation 
of the solution. The first half of this paper discusses the difficulties in addressing this 
problem and references the literature to outline a correct proof of our characterization. 
Note that we do not initially impose a discretization of L 1 (Q) or Q, only of the data. The 
discretization we shall use arises as a consequence of the form of the functions ip ™ . 

The second half of this paper discusses implementation details. It will turn out that 
the appropriate grid for optimal resolution (the “optimal grid”) is very large compared 
to the amount of data, N , one has. We shall also see that finding the optimal function 
can be reduced to solving a problem in IR N , but the intermediate calculations require use 
of the (large) optimal grid. We have found that a simple cyclic coordinate maximization 
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scheme applied to the Fenchel dual of (1.5), while convergent, tends to stall after a few 
iterations. We describe a multigrid approach to accelerate convergence. 


2 Maximum Entropy Solutions 

2.1 The Entropy Functional 

Using the entropy functional 

/ : L l (Q) — > (— oo,+oo] : x — f <j>(x(s,t))ds dt (2.1) 

j n 

to pick a “best” density function x has been popularized by Shannon in information 
theory, but also arises in the context of thermodynamics with Boltzmann. For this reason 
we call this the Boltzmann-Shannon entropy. 

Entropy is, in short, the expected amount of information present in a probability 
density x. In our context, one can think of an image (appropriately scaled) as a probability 
density function, and computing the feasible density with maximum entropy yields the 
density carrying the most information. The standard reference is Shannon and Weaver [3], 
but many basic probability books contain some discussion of information and entropy (see 
[4, 7] for example). References that deal specifically with entropies in image reconstruction 
are [5, 8]. 

We remark that the theory and methods developed here apply directly to other objec- 
tive functionals, in particular, minimum L 2 -norm solutions. In fact, with the minimum 
L 2 -norm functional, our iterative method is essentially only changed by replacing * by + 
and / by — . 

2.2 Existence of Solutions 

In this section we prove that solutions to (1.5) exist. Usually, this point is ignored, but is 
nevertheless an important issue. 

Throughout, let X be a linear normed space with topology r. We begin with some 
definitions. 

Definition 2.1 We say a set K C X is r-sequentially compact if every sequence from 
K h as a T-convergent subsequence. 

Definition 2.2 Given a function f : X — > (— oo,+oo]. for a € H, we define the lower 
level sets L a of f to be 

L a = {x : f(x) < a}. (2.2) 

The following is a general existence theorem for solutions to constrained optimization 
problems. 
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Theorem 2.3 Let f be t - lower scinicontinuous (lac) possessing r -sequentially compact 
lower level sets, and let C be a t - closed set . The n 

p = inf {/(x) : x G C} (2-3) 

is attained for some Xq G C . 

Proof: Let {x n } cCbea sequence such that /(x n ) -+ P • Letting a = p + 1, eventually 
all x n G L a , for n > N. By r-sequential compactness of L a there is a subsequence x Uk 
that converges to a point xo G L a . Since C is r-closed, then xo G C\ f is r-lsc, so 

p = lim f(x nk ) > /(x 0 ) > p (2-4) 

h — - oo 

amd, thus, p = f(x o). ■ 

We are interested in the case where the r-topology is the weak topology on L 1 (Q) . 
Let C = {x : Ax = 6} for A : X —> 1R N linear and continuous. Then C = A~ l ({6}) 
is closed (since A is continuous) and convex (since A is linear) and, hence, weakly closed. 
This is called Mazur’s theorem; see [9, Corollary 4 Chapter 2], for example. 

We now direct our attention to the functional 

/(: x ) = J x(s, t) ln[x(s, t)]ds dt. (2.5) 

Prom [10, Theorem 2.2], we see that / is weakly lsc with weakly compact lower level 
sets, provided 0 is of finite measure. To apply Theorem 2.3, we need weakly sequentially 
compact lower level sets. The Eberlein-Smulian theorem [9] states that a subset of a 
Banach space is weakly compact if and only if it is weakly sequentially compact, and thus 
we see that the lower level sets of / are weakly sequentially compact, so we can apply 
Theorem 2.3 to obtain the following: 

Theorem 2.4 With f defined as m (2.5) and C = {x : Ax = 6} for A : L*(Q) — ► JR 
linear and continuous, the mfimum 

p = inf {/(x) : Ax = 6} , (2.6) 

if finite, is attained for some Xq G L 1 ( Q) such that Ax o = b. 

Note that if X = L°°, then the level sets L a of / are not bounded in the norm 
topology. Hence L a is not compact for any of the norm, weak or weak-* topologies, 
which is a consequence of the fact that a continuous function attains its maximum on a 
compact set, of the theory of dual pairs[ll], and of the principle of uniform boundedness, 
respectively. Thus, although it is tempting to approach the image reconstruction problem 
in L x , the initial problem of existence is much more difficult. However we will see that 
L 1 -optimal solutions are actually -optimal solutions as well. We will also see why one 
might want to pose the problem in L x . 
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2.3 Uniqueness of Solutions 

In this section we show that solutions to (1.5) are unique. 

Definition 2.5 A function g : X — » (— oo,+oo] is convex if, for all x,y E X and all 
X E (0, 1) , we have 

g{Xx + (1 - X)y) < X g(x) + (1 - X)g(y). (2.7) 

Also, g is strictly convex if whenever x/y and g(x), g{y) E JR, this inequality is strict. 
A set C C X is convex if whenever x,y E C and X E (0, 1), then 

Ax + (1 - A )y E C. (2.8) 

Lemma 2.6 If f(x) = ^(x(u))du, then f is strictly convex if and only if <p is strictly 
convex. 


Proof: Assume that / is strictly convex and that there exists a u / v and a A E (0, 1) 
such that 

4>{Xu + (1 - X)v) > X <p{u) + (1 - A )<t>{v). (2.9) 

Then let x(t) = u and y(t) = v, so that 

/(Ax + (1 - A )y) > A/(x) + (1 - A )f(y), (2.10) 

contradicting the assumption that / is strictly convex. 

Conversely, let E = {t : x(t) ± y(t)} and assume that m(E) > 0. Then 

/(Ax + (1 - A )y) = f <f>(Xx(t) + (1 - A )y{t)dt + / <j>(x(t))dt (211) 

< J e A <t>{x(t)) + (1 - A )<f>(y(t))dt + J <f>(x(t))dt (2.12) 

= A/(x) + (1 — A)/(y) (2.13) 

where the strict inequality is due to the strict convexity of ef>. ■ 

It is easy to check that <f>(u ) as defined in (1.6) is strictly convex, thus / from (2.5) is 
as well. 

Theorem 2.7 If f is strictly convex and C is a convex set f then solutions to 

P = inf {/(x) : x E C } (2. 14) 

arc unique, provided they exist. 
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Proof: Suppose p = f(x) = f{y), where x^y EC. Since C is convex, then \x+\yeC 

and , , 

f(x/2+y/2) <-f(x) + -f(y)=p (2.15) 

contradicting the definition of p. ® 

Putting together these results we have the following. 

Theorem 2.8 ff Q is a set of finite measure, then the solution to (1.5) exists and is 
unique. 


2.4 Characterizing the Solution 


In this section we characterize solutions to 


p = inf <j>(x(s , t))ds dt : J xi \)™ = 5™, m — 1, . . . , M, 


k = 



(2.16) 


where 


f win u — u 


(j)(u) = 


< 


0 

+oo 


u > 0 
u = 0 
u < 0. 


(2.17) 


This is the image reconstruction problem we introduced in Section 1. We have included 
a linear factor in the objective functional; however, if we assume that the projections 
cover the image, then this factor does not change the solution; it only simplifies certain 
formulae. A typical approach is to attach a Lagrange multiplier A to the constraints and 
differentiate the Lagrangian at the optimal Xq (which we now know exists) to obtain 


x Q (s,t) = (</>') 1 (A T A(s,t)) = exp (A T A(s,t)) , (2.18) 

Whete M K m 

A t : L°°(Q) — ► JR n via (A t A) ( s,t ) = 5Z ( s >0- (2.19) 

m= 1 jfc=l 

However, the functional f(x) = J <f>{x{s,t))dsdt is not differentiable. Indeed, / = +oo on 

a dense subset of L 1 (O) and is therefore not even continuous. It is for this reason that 
some people have chosen to work in C(fl) or L ' (fl) [12, 13] where / is differentiable, but 
the question of existence and attainment is much more difficult there. 

A correct approach is to use Fenchel duality with a constraint qualification (CQ). 
While the classical CQ fails to apply in our example, in [6] a CQ is developed that does 
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( 2 . 20 ) 

( 2 . 21 ) 

( 2 . 22 ) 


apply to CT. Heuristically, we assume a multiplier A exists so that 

p = inf {/(x) + (b - Ac, A)} 

= (6j>+inf{/(i)-(x,>ra)} 

= (6, A) - sup { (x, /l r A) -/(x)}. 

For a convex function g : X — » (-oo, +oo] we define g* : X’ — *■ (— oo, +oo] by 

9*(y) = sup{(x,y) - g(x)}. (2.23) 

X 

This is called the Fenchel conjugate of g at y. Using this definition, we see from (2.22) 

p= (6,A) -/* (A 7 !). (2.24) 

Now, for any A, 

inf {f(x) + (b- Ax, A)} = (6, A) - f* (A t A) < inf {f(x) : Ax = b} = p. (2.25) 
Therefore, if A exists, then 

p = max Ub, A) - /*(A T A)} . (2.26) 

This is the Fenchel dual of (1.5). In our problem, it can be shown [14] that for y € L°°(Q) 

f*(y) = f <f>’(y(s,t))dsdt, (2.27) 

that is, the conjugate of the integral functional / is given as an integral function of 
<f> *- From [6], if 3x e L 1 (Q.) where x > 0 a.e., f(x) € R and Ax = b (the constraint 
qualification), then a Lagrange multiplier A does exist. Also, if A solves (2.26), then the 
optimal xq (s,t) for (1.5) is 


x 0 (s, t) = (<£*)' (A r A(s, t)) . (2.28) 

In the case (f> is of the form (2.17), it is easy to show that <f>*(v) = e v , and we get 

Xq (s, t) = exp (A^Xis, t)) , (2.29) 

where A solves (2.26). This matches the heuristic derivation, but this is no accident since 
in fair generality, = (0*)'- Thus, solving the image reconstruction problem, (1.5), is 

equivalent to solving the dual problem (2.26), which is an unconstrained finite-dimensional 
differentiable concave maximization problem. 


368 



iiiiiJiiiUi, 



Figure 2: The optimal grid. 


2.5 The Optimal Grid 

Rewriting the solution to the image reconstruction problem (2.29) in a more explicit form, 


we have 


X 0 (s,t) = exp 



(2.30) 


see (2.19). It can be seen from the concavity of (2.26) that any function x of the form 

(2.30) that satisfies Ax = b must be optimal. 

Recall that -i is the characteristic function of the fc th strip emanating from the 
m th projection. Thus, the solution in (2.30) implies that the optimal function in L l {9.) is 
piecewise constant on the grid obtained by intersecting all of the strips. This grid has been 
observed for physical reasons, [8], but here we have shown that the best (from maximum 
entropy considerations) function from L^ft) is this piecewise constant function. For this 

reason, we call this discretization of f l the optimal grid. 

A typical grid is shown in Figure 2. Here we have used 8 projections each divided into 
12 cells. The central point of the theory presented above is that the exact solution of the 
image reconstruction problem is piecewise constant on this grid. 
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3 Implementation 


3.1 Relaxation 


Now we develop a simple iterative procedure to solve the image problem by solving the 

dual problem (2.26) for the optimal A. Recall that to solve the dual problem we seek to 
maximize 

G(\) - {b, A) - J exp(A r A(s, t))ds dt. (3.1) 

This is a simple matter once one observes that G is concave and differentiable, so maxi- 
mizing G is just finding critical points. The derivative with respect to A is 

V(7(A) =b- A J exp(A T A(s, t))dsdt. (3.2) 

This gives N equations for the N unknown A’s. 

An obvious iterative scheme is to cycle through the components, A fc , of VG, correcting 
each A* in turn so that the k ih component of VG is zero, that is, so that G is maximal 
with respect to each individual variable. We choose A™ so that 


b™ - (A exp(A T X)) k m = J exp(A T A(s, t))ds dt. (3.3) 

* “k 

After some simplifications, a single step of this scheme can be written as 


exp(Ap') _ 



(e'w) Ln'w)* r ’ 


(3.4) 


where is the sum over all projections m and cells k except m' and kf, T7' is similarly 
defined and p,™ = exp(AJ*). J 

Now, because A T A is piecewise constant on the optimal grid, the integral of expM T A) 
along any strip is just a sum over each polygon in that strip of the area of the polygon 

times the product of the /i’s that correspond to the particular cell from each projection 
that makes up the polygon: 



area(p) JJ ^ , 

m^m' 


(3.5) 


where the product is only over the cells k m in projection m for polygon p. 

The point of this discussion is twofold. First, we can see from (3.4) that we never 
need to exponentiate since we only need the p’s. Second, the areas of the polygons can 

be precomputed, meaning that these integrals can be calculated exactly; no numerical 
integration is needed. 
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Figure 3: A subset of the optimal grid. 

An important feature of this development is that there are no approximations or 
discretizations being made in the whole program, except for the initial discretization of 
the data into the vector b. We have characterized the L a (Q) solution as a piecewise 
constant function on the optimal grid, and the necessary integrals can be performed 
exactly on this grid. For this reason, in our implementation we calculate the areas of each 
of the polygons in the optimal grid; in fact, to obtain Figure 2, we in fact plotted the 
polygons themselves, not just intersecting lines. To demonstrate this fact, in Figure 3 we 

plotted 1000 of the 2096 polygons from Figure 2. 

As can be observed in Figure 2, the number of polygons in the optimal grid can be very 
large compared to the number of projections and cells per projection. This number grows 
very quickly as a function of these two variables; for example, with 12 projections placed 
uniformly around the circle and 10 cells per projection, we generate 2904 polygons; with 
20 cells per projection we generate 12,561 polygons. While the optimal grid generation 
is a time and storage intensive procedure, given a fixed geometry we need only run this 
part of the program once. To reconstruct images using such large data sets, fast methods 
are essential. 

The iterative method described above tends to stall after a few iterations. This effect 
can be observed in the reported data from [8]. In Figure 4 the top two curves represent the 
rates of convergence using this scheme on a 3 projection, 4 cell per projection problem. 
Here we have graphed both the rate associated with the residual \\Ax - 6|| x and the 
rate of convergence of the entropies computed as G(A new ). Since we give as initial data a 
known function of given A’s, we can compute the true entropy to which the iterates should 
be converging; this is a useful debugging tool since we can monitor both the residual and 
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Figure 4: Residual and entropy rates, with and without MG. 

the entropy. While convergent, note that the convergence is initially good, but the rate 
degrades rapidly to steady off at about 0.85. 

3.2 Multilevel Methods 

To improve the convergence rates observed, we have implemented a two-grid multilevel 
: scheme with unigrid [15] corrections. In this section we describe our method and give 

some preliminary results. 

Coarsening is achieved by pairing adjacent cells in the projections and updating the 
associated /ds with a single correction that makes their average residual zero. On the 
coarse grid, we iterate this relaxation process until the norm of the vector of average 
residuals is below a user supplied e, typically 0.05. All of the calculations are done in a 
unigrid fashion on the fine grid. This makes the process more expensive than necessary, 
but its performance is equivalent to the more efficient V-cycle multigrid scheme and it is 
much easier to implement and manipulate. 

Using the same geometry and data as before, but with relaxation accelerated by coarse 
grid corrections, we obtain the rates given in the lower two curves in Figure 4. Note that 
with only a two-grid scheme, we have reduced the convergence rate from about 0.85 to 
about 0.72. 

As a demonstration of the reconstructions we can obtain, we present Figures 5 and 6. 
Recall that the reconstruction is computed on the optimal grid, but for plotting purposes 
we essentially use a square grid. While there are several ways of translating from the 
optimal grid to a square grid, we have chosen simply to evaluate the optimal image 
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Figure 5: Original image. 


function as reconstructed from the optimal A via (2.30) at the lattice points of the square 
grid without performing interpolation. Improving this process could be a direction o 
future investigation. 

Figure 5 is the original image. Note that this is not a piecewise constant function 
on the optimal grid, so our reconstructions cannot be exact. In fact, the function of 
maximum entropy with the data generated by this function is not the original image; as 
shown previously, it is piecewise constant on the optimal grid, as are our reconstructions. 

To obtain Figure 6, we use a simple two-dimensional integrator, based on Simpson s 
rule, on the original function to make 6 using 5 projections each with 8 cells. We then 
iterated our multilevel scheme to convergence (so that the £ 2 norm of the average residuals 
was less than 10~ 4 ) and plotted the result in Figure 6 as discussed above. This process 
produces a set of A’s that we then used as our initial data in the routine to obtain a 
reconstruction of the first reconstruction. This next reconstruction is virtually identical 
to the first, as the theory predicts. 

As a final note, an extra benefit of this scheme is data compression. Given a data 
collection geometry, when we collect the N pieces of data, we need only solve the recon- 
struction problem once to get the N n's. From these we can reconstruct the image to any 
level of resolution desired; indeed, we have shown that the piecewise constant function on 
the optimal grid is the most information one can extract from the data. 



Figure 6: Reconstruction. 


4 Conclusions 

We have seen that the solution to the maximum entropy image reconstruction problem 
posed in L l (Q) is a piecewise constant function on the optimal grid. We have also seen 
that each iterate of the simple iterative scheme for solving the associated dual problem can 
be computed exactly; no numerical integration or approximations are needed. Finally, we 
observed that a unigrid scheme to accelerate convergence shows potential, though more 
testing and analysis is needed. 

As a final comment, we note that that the mathematics used to derive the optimal grid 
and the iterative scheme can be applied to other objective functionals / and other geome- 
tries, for example, minimum L 2 -norm, fan beam projections and non-symmetric placement 
of the projections. These issues and a more complete discussion of the mathematics in 
optimization techniques for image reconstruction from projections will be covered in a 
future paper. 
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SUMMARY 


We develop a new numerical approach to study the spatially evolving instability of the 
streamwise dominant flow in the presence of roughness elements. The difficulty in handling the flow 
over the boundary surface with general geometry is removed by using a new conservative form of 
the governing equations and an analytical mapping. The numerical scheme uses second-order 
backward Euler in time, fourth-order central differences in all three spatial directions, and 
boundary-fitted staggered grids. A three-dimensional channel with multiple two-dimensional-type 
roughness elements is employed as the test case. Fourier analysis is used to decompose different 
Fourier modes of the disturbance. The results show that surface roughness leads to transition at 
lower Reynolds number than for smooth channels. 
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INTRODUCTION 


Transition from laminar to turbulent flow is a phenomenon of great importance and practical 
interest. Major experimental work in aerodynamics has been pursued to study boundary layer 
transition, and this in turn has led to a critical need for further understanding of this fundamental 
process. Unfortunately, to date, no reliable methods exist for predicting transition in the presence of 
surface roughness, either experimentally or numerically. In fact, there are many factors that affect 
transition, such as solid wall temperature, solid wall curvature, pressure gradients, free-stream 
disturbance, and surface roughness. There has been some experimental activity dealing with the 
effect of both 2-D and 3-D roughness elements on transition from laminar to turbulent flow. For 
example, Klebanoff et al showed that surface roughness induces early transition [1], Also, others 
have obtained limited numerical results with 2-D flow [2], 

The purpose of the present work is to develop new efficient and easy-to-use methods for 
numerical simulation of the effect of surface roughness on flow transition. A new conservative form 
of the governing equations is derived. Because of the high sensitivity of transitional flows, a high 
order scheme based on our earlier work (cf. [3] - [6]) is developed. For the grid generation scheme, 
an analytical map is used, so the Jacobian coefficients are computed exactly. Moreover, we develop 
the governing equation in a form that enables a much simpler numerical process. 

We impose single and multiple 2-D-type roughness elements on the lower solid wall to test the 
effect of surface roughness on flow transition. A Fourier transformation is employed to analyze 
different modes of the resulting disturbance. The computational results show that the induced 
mean flow distortion and other high frequency waves make the flow more unstable. 

* This work was supported by NASA under grant number NAS1-19312. 
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GOVERNING EQUATIONS 


In this study, the three-dimensional, time-dependent, incompressible Navier-Stokes equations, 
which axe nondimensionalized by the channel half-height h and the centerline velocity are 
considered as the governing equations for 3-D channel flow (Figure 1): 


du duu duv duw 1 ,cPu d 2 u d 2 u dP _ 

dt dx dy dz Re dx 2 + dy 2 + dz 2 + dx 

dv dvu dvv dvw 1 . d 2 v d 2 v d 2 v . dP _ 

dt ^ dx ^ dy + dz Re dx 2 dy 2 dz 2 dy 


dvu dvv 
dx dy 


dw dwu dwv dww 1 d 2 w d 2 w d 2 w. dP _ 

dt ^ dx ^ dy + dz Re dx 2 dy 2 dz 2 dz ’ 

du dv dw n ... 

a V% + & =0 ' (4) 

where u, v, and w are velocity components in the x — , y— , and z— directions, respectively, P is the 
pressure, and Re is the Reynolds number based on the centerline velocity of mean flow, the 
channel half-height h, and the viscosity parameter v\ 

Re = (5) 



Figure 1. 3-D channel with a single 2-D-type roughness element. 


For the current work, we consider a special mapping 


which implies that 


x — £ £ = x 

y = v(£,y, 0 or y = y(x t y, z) 
z = ( < = *» 


J — Vyj 
£* = Cz = b 

£z = Cx = o, 


■A?** 



and the final forms of the momentum and continuity equations are: 


du ,duU duV duW . , d , a, D 1 A n 

Tt + ^ + -W + + ( T( + *^ )P - W = °’ 


dv .dvU dvV dvW, dP 1 . 
diy auW auW. d d . 1 

^ + ^ + -^ + -9T ) + (77 ^ + ac )p -^ AlU; = 0 ’ 

at/ ay aw n 

+ — + — = o, 


ac drj ac 


for the toted flow, or 


a« ,afu(f/ + [/ 0 ) + ^] a[«(v + y 0 ) + «oV] a[«(w + w 0 ) + «oW], 

w + »„( aj + ^ + Tc ' 

+( ! + ’ ? 4 )P ~Si Al, ‘ = 0 ’ 

dv ,d\v{u + U 0 )+v Q U] d[v(V + V 0 ) + v 0 V] d[v(W + W 0 ) + v 0 W}^ 

Tt + ^ ( df + ft? + T > 

dP 1 A 

^ “ jte Al * = °' 

dw ,d[w(U + Uq) + icot/] a[iy(V 4- Vo) 4- 100 ^] d\w(W 4- Wo) 4- u>oW]^ 

^+vA g| + gj + a< > 

+ (’'4 + I )p -^ Aiw=o ' 

dU dV dW n 

— + — + — = 0 , 


a£ dr/ ac 


for the perturbation flow, where, 


a 2 


Al “ ^2 + fax + + + ac 2 + 2r)x dtfh) + 2r]z ~Z^ + ^ xx + ^ + 7?zz ^‘ 


'dr)d<: 


_a 

'dr)' 


The inverse transformation of the variables under this special mapping becomes 

u = Ur) y , 

w = Wr) y , 

v = V — Ut) x — Wrj z . 


( 6 ) 

(7) 

( 8 ) 
(9) 


(10) 


( 11 ) 


( 12 ) 

(13) 

(14) 


(15) 

(16) 
(17) 


Here we have seven unknowns (tz, v, w, P, U, V, W), seven equations ((6) - (9), (15) - (17)) for 
the base flow or total flow, and seven equations ((10) — (13), (15) - (17)) for the perturbation. 

Our solution process is outlined as follows: 

1. Perform the surface and grid generation process to obtain the required Jacobian coefficients. 
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2. Solve system (6) - (9) and (15) - (17) to obtain the base flow solution. 

3. Solve system (10)-(13) and (15) - (17) to obtain the perturbation solution based on the above 
base flow. 


For the channel flow, though the boundary conditions are quite simple, there are still some 
difficulties we need to overcome. The so-called buffer domain [7] technique is used here for both the 
base flow and perturbation. For details, see [6]. The boundary condition for the solid wall is the 
no-slip boundary condition: 

Inwall V wall Inwall ^wall A. (i®) 


No boundary condition is needed for the pressure at the solid wall since we use a staggered grid. 
For the inflow, Poiseuille flow is imposed at the inlet for the base flow solution, and the 
eigenfunctions obtained from the linear stability theory with specified Reynolds number are 
employed at the flow inlet for the perturbation. The final outflow boundary conditions are: 


• for the base flow, 


• for the perturbation flow, 


dU 

dS 


+ 


d 2 U 

ae 

d 2 V 

d 2 W 

oe 

dV dW 
1hi + d C 

&V 

d? 

d 2 W 

ae 


0 

0 

0 


= 0 
= 0 
= 0 


for C7, 
for V, 
for W, 

for U, 
for V, 
for W, 


(19) 


( 20 ) 


and the associated u, v, w can similarly be obtained by the inverse transformation (15) - (17). 


Periodicity is assumed in the spanwise £— direction. 


NUMERICAL PROCEDURE 
Surface and Grid Generation 


Assume that no stagnation points exist in the computational domain. Solitary type roughness 
elements are overlapped on the lower solid wall of the channel to simulate surface roughness. For 
the 2-D roughness elements (because the grid is uniform and the domain is periodic in the spanwise 
z direction, we need only discuss the 2-D case here), the surface can be expressed as 

m 

f{x) = ^Kisech 2 {bi{x -xj)), (21) 

i = i 


380 



where «{ is the height of the roughness element, bi is a parameter for adjusting the curvature rate of 
the roughness element, and x* is the peak point coordinate. 



Figure 2. Physical, stretched, and uniform grids. 


Our grid generation approach consists of two mappings(see Figure 2): 


• from the physical grid that conforms to the rough boundary to the stretched intermediate grid 

that is uniform in 

the x direction but nonuniform in y, 


• from the stretched grid to the uniform computational grid. 


The resulting mapping from the physical to the uniform computational grid is 



Umax 2/max) (2/ f{pL ’)) 

J/max (<7 + y) - + y max ) ’ 

(22) 

and its inverse map is 

_ V^iVmax ~ /(*)) + ymaxf(x)(<J F y ma x ~ V) 

2/max "1“ Vmax V) 

(23) 

where y max is the maximum height of the computational domain and a is the parameter for 
adjusting the density of grids near the lower solid wall. The required Jacobian coefficients are 


Vx 

_ Vmaxfx^i^ “1“ J/max) (?/ Vmax) 

[ymax(& H" y) !/mai)]^ 

(24) 

Vy 

___ Vmax & (<7 + 2/max) (j/max /(*^)) 

[Vmaxicr +V)~ f(x){<T + Vmax)] 2 ' 

(25) 

Vxx 

Vmax^i^ F J/max) (?/ 2 /max) 



fxx [ymax (<r + y)~ f(x)(<T F y m ax)] F 2/J(<7 + J/max) 

(26) 


[j/max(<T + y) ~ /(x)(<T + J/max)] 3 

Vyy 

^J/max^(^ F ymax) (ymax / (x)) 

(27) 

[: ymax(o F y) - f(x)(cr + ymax)] 3 ’ 

Vx 

= Vzz = 0. 

(28) 

Here, /» = §£ and f xx = 

§1± 

dx*' 
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Discretization 


In the computational (£, rj, () space, the grids are uniform. Suppose u, v, w and U , V, W are 
defined in terms of a staggered grid in the computational space (see Figure 3). Here, the values of P 
are associated with its cell centers, u and U with centers of the cell surfaces parallel to the (77, £) 
plane, v and V with centers of the cell surfaces parallel to the (£, () plane, and w and W with 
centers of the cell surfaces parallel to the (£, 77 ) plane. 

Second-order backward Euler differences are used in the time direction, and fourth-order central 
differences are used in space. Details of such an approach can be found in [5]. With those 
assumptions, we can write the discretized governing equations symbolically as follows: 


AeE u EE + AeUe + A\yuw + Aww^ww + AffN u NN + AjfUff + AsUs + Assess 
Affv-ff + Af^f + Ab^b + Abb u bb — Ac^c + DwwPww + DwPw 4 - DePe - DcPc 
Bee v ee + Beve + Bwvw + Bwwvww + Bnn v nn + BffVff + Bsvs + Bssvss 
Bffvff + Bfvf + Bbvb + Bbb v bb — Bcvc + EssPss + EsPs + E^Pn - EcPc 
Ceewee + Ce^e + Cw w w + Cww^ww + Cnn w nn + Cnw n + Cs^s + Cssviss 
Cff w ff + Cfwf + Cbwb + Cbbwbb — Ccvjc + FbbPbb + FbPb + FfPf — FcPc 
DUeeUee + DUeUe + DUwUw - DUcUc + DVnnVnn + DVnVn 
DVsVs — DVcVc + DWffWff + DWfWf + DWbWb — DWcWc 

u c = Vy C Uc, 

Wc = T] y C Wc, 

VC = vl c U c + V c +v: c W c . 


+ 

= S u , ( 29 ) 
+ 

= S„, ( 30 ) 

+ 

= s w) ( 31 ) 
+ 

= S m , ( 32 ) 

(33) 

(34) 

(35) 



Figure 3. Staggered grid structure in the computational (£, 77 , C) space. 

The coefficients and source term for the interior points of the discrete £— momentum equation (29) 
associated with uc are given as follows: 
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VyC_ 
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12PeAC 2 + 12AC^ // + 
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351 ac 1 
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(36) 


Here, superscripts n and n — 1 are used to indicate values at previous time steps, and the 
superscript n + 1, which indicates the current time step, is dropped for convenience. Lower case 
subscripts denote the approximate values of the v and w at points where the associated values of u 
are located (Figure 4). Other symbols used in the above formulas are as follows: 


a Wx 4“ Vy 4“ Tj z , 7 T) XX + Tfyy + T ) zz , 


du 

U{ " 8f' 


“< = 


<9u 

«C‘ 
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C £ * Z 


Figure 4. Neighbor points for £— momentum equation 
(U are at the same points as u and are not shown here). 

All function values that are required at other than the canonical locations are obtained by 
fourth-order interpolation in the computational space. For example (see Figure 5), 

V c = (9(Vc + Vn 4- Vnw + Vw) ~ {Vsww + Vsnww + Vse + Vnne))/ 32 , ( 37 ) 

P a = (9 (Ps + Psw ) — ( Pse + 7Ww))/16. (38) 

The coeflBcients for the rj- and momentum equations are defined in an analogous way, the 
discrete continuity equation is developed simply by applying central differences to each terms. 

On the solid wall boundary points, we change the 77 -direction difference to second order, and 
maintain fourth-order in both the and directions. For more details, see [ 6 ]. 



Figure 5. Neighbor points for fourth-order approximation for V c and P 3 . 
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Line Distributive Relaxation 


We use here the same basic line distributive relaxation method developed in our previous work 
(cf. [5]), with some modifications. Figure 6 shows the distribution of corrections for the group of 
variables that are located in the (£, 77 ) plane. 

The process that we use for solving the discrete system (29)-(35) can be described as follows: 

• Freezing F, U, V, W, v, and w, perform point Gauss-Seidel relaxation on (29) over the entire 
computational domain to obtain a new it. 

• Freezing F, £/, V, W, it, and w, perform point Gauss-Seidel relaxation on (30) over the entire 
computational domain to obtain a new v. 

9 Freezing F, U, V, W, it, and v, perform point Gauss-Seidel relaxation on (31) over the entire 
computational domain to obtain a new w. 

9 Use transformation (33)-(35) to obtain new C7, V, W . 

9 For all J — — 2, 3, * * * , Tij 1 at once, change U%jky 1/i+i jk y Vijk > k- 4-1 to satisfy the 

continuity equations, then update F^* so that the new U, V, W and F as well as the 
associated transferred u, v, vu satisfy the three momentum equations. 


U m — C6 

1 

Ft 6fc 

v; 6 *- 

+ 76 

3 

h £5 — #6 

Ui+1 6 k + 

U% C5 

F i5 fc 

- 

+ 75 

h 64 — £5 

Ui+1 5 k + ^5 

Ui 4 k ~ e 4 

Pi 4 k 

V i4k -^ 

+ 74 

7 

{-63 — 64 

1 , 

Ui + 1 4 k + £4 

UiZk ~ ^3 

1 

7 > »3fc o + 73 
^i 3 fc + 62 — 63 

_J 

Ui+1 3 k + e 3 

U i2k ~ ^2 

Pi 2k 

1 

+ 72 

D — 



Ui+l 2k + ^2 


V\ 


<b- 


Vu k = 0 


r e 

Figure 6 . Distribution of corrections in the (£, 77 ) plane. 
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Remark. Since all of the u, v, w have been previously relaxed, and the U , V, W are updated to 
perform the latter corrections, we assume that equations (29)-(31) hold exactly. Let e, 6, a and 7 
represent the corrections for U, V, W and P, respectively. Thus, for cube ijk, when we distribute 
the corrections according to Figure 6, the correction equations corresponding to (29)-(31) are 

T' “ + A% l r,^)e i - D% k lt = 0, (39) 

Bf(Sj ~ «,-«) - 1 - «,) - BgS, = 0, (40) 

(CgV'**' + )«i - ■f’SSj = 0, (41) 

(DU% k + DU g‘)t, + (ZW»* + DWf)a, + DV» k (S ) - S Hl ) - DV$ k (l - «,) = S«», (42) 

J = 2, 3, • • • , rij - 1. 

This system has 4(71, — 2) equations and 4(n, — 2) variables. Unfortunately, coupling between the 
correction variables makes the problem somewhat complicated. To develop a simpler approximate 
system, define 


Uxj = 


v Z j = 



with fixed i and k. Then 

w = (Bf ± - B% k S J+1 - 

X3 Ef (A% k r b Ut+i Jk +A i S k V y 3k )S j 
u = (*%* + B% h )6j - B% k S j+1 - RffV i 

Ef {C% k rg'^ + &i k r%'’ k )6 j 

From (36), we see that A l jL k ~ for high Re and small At, which is much larger than A% k . 
Similarly, B'£ k ~ ~ and C]i k ~ jSt . These yield 


(43) 

(44) 


D J C _ A77 

Ebvv 3k A&FW* ’ 

F’c _ At, 

E ] c Vv’ Jk A CVv ,Jk Vv Jk ' 


(45) 

(46) 


With the above approximations, u X j and u> Z j can be treated as known parameters, so equation (44) 
can be written in terms of the unknowns 6j only: 


{(DU‘i k + DU% k )u xj + (DVf + DVg k ) + (DW'i k + DW%*)w, t )S j 

-DVg'S,., - DV% k 6j+i = S‘l k . (47) 

Let 


a, = (DU¥ + DU i i k )u xl + (DVii k +DV% k )+(DW , £ k + DW i S k )<j, j , (48) 

b , = -DV” k , (49) 

Cj = -DVg k , (50) 

j = 2,3,---,n i - 1. 
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Then we obtain the tridiagonal system 


«2 

62 

“ 




SS* I 

c 3 

03 

63 


<5s 


■s™ 



Crij-2 Onj -2 hrij -2 

... 

CM 

1 

"V 


^,171^ -2 fc 



Crij-1 ^rij— 1 


L Gnj-l J 


k 

_ Om 


(51) 


Thus, 6j, j = 2 , 3, • • • , rij — 1 can be determined very efficiently. The other velocity corrections are 
given by 


i Gj — ^zj ? 

j 2, 3, * • • , Tlj 1. 


The u,v, and w are then updated on all cells in the i, k y — line as follows: 


U i+ 1 jk * 

- J7 i+li * + e i , 

U ijk <r 

- U ijk -ej, 

yyij k+ 1 

- W ij k+1 + (Tj, 

4 

- W ijk - a 3 , 


j = 2, 3, ■ • • , rij - 1, 


- V iik + 6j-\ — 6j, 


j = 3,4,---,n.,- ~ 1- 


Finally, the pressure corrections 7 j are determined as follows: 

^g*+Bg t +^Cg t f 

12 

+ Bf +£r:‘s\ 

TJ = + + ■ 

V + At, + 

j = 3, • * • , Tlj - 1. 

P is then updated via 

Pijk * ^ijk Tjj 
j ~ ^ 1 3 , ’ • • , — 1 . 


(52) 

(53) 


(54) 

(55) 


(56) 
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COMPUTATIONAL RESULTS 


We first tested the code by applying it to single and double roughness elements using a moderate 
sized grid. A 170 x 50 x 4 grid, including a five T-S wavelength physical domain and a one 
wavelength buffer domain, was used. Considering that Re = 5000 corresponds to a decreasing mode 
according to the linear stability theory, we use this Reynolds number in our code to investigate the 
effect of roughness elements. Fourier analysis is used to decompose different Fourier modes of the 
disturbance. For details about the Fourier transformation approach, see [8]. 

Let Kj = 0.15 and b t = V5. For the single roughness case, the peak point is at i = 30; for the 
double roughness case, the peak points are at z = 42 and i = 70. The stretch parameter cr is set to 
4, and the amplitude of the disturbance e is set to 0.0025y/2. Figure 7 displays contours of the 
perturbation streamfunctions, showing that the roughness elements make the disturbance increase 
for a certain distance downstream. The results of Fourier transformation given in Figure 8 show 
that the mean flow distortion and first and second harmonic waves are amplified over this distance. 

To test the effect of multiple elements, a 402 x 66 x 4 grid (including a nine wavelength physical 
domain and a one wavelength buffer domain) is used. Here we set kj = 0.12, 6* = 2.0, and <r = 4.5. 
The first roughness is at i = 82, which is two wavelengths from the inflow boundary. We placed 
seven roughness elements in the computational domain, starting from i = 82 and spared 40 grid 
points apart. Figures 9 and 10 depict the contour plots of streamfunctions and vorticities, showing 
very clearly that the disturbance is amplified after it passes each element. 




(b) 
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Figure 7. Contours of perturbation streamfunctions with Re = 5000, Ki = 1.5, 
e = 0.0025\/2 and 170 x 50 X 4 grid. Flow direction is from left to right. 






10“ 2 rms 










Figure 8. Maximum amplitudes of fundamental wave ui, Vi, mean-flow distortion uo, t/o, 
first harmonic wave 112 , V 2 } and second harmonic wave 1/3, U3 for Re — 5000, 

K t = 0.15, and e = 0.0025^ with two roughness elements (grid: 170 x 50 x 4). 










Figure 9. Contour plots of perturbation streamfunctions and vorticities 
for Re — 5000, Ki = 0.12, and e = 0.0025-\/2 with seven roughness 
elements (grid: 402 x 66 x 4, flow direction is from left to right). 


total atreamfunction contours 



(b) 


Figure 10. Contour plots of total streamfunctions and vorticities for 
Re = 5000, At; = 0.12, and e = 0.0025-\/2 with seven roughness 
elements (grid: 402 x 66 x 4, flow direction is from left to right). 


390 



CONCLUDING REMARKS 


As expected on physical grounds, we find that the spatial growth rates of the disturbance 
increase when surface roughness is present. Though our work is limited to roughness without 
stagnation points in the computational domain, such a scope includes a rather large variety of real 
roughness elements. Moreover, the code is very efficient, requiring about 2.68 seconds per time step 
for the 402 x 66 x 4 grid case (equivalent to about 26 ps per time step per grid point). 
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SUMMARY 

We consider the numerical solution of the single-group, steady state, isotropic transport 
equation. An analysis by means of the moment equations shows that a discrete ordinates Sn 
discretization in direction (angle) with a least squares finite element discretization in space does not 
behave properly in the diffusion limit. A scaling of the Sn equations is introduced so that the least 
squares discretization has the correct diffusion limit. For the resulting discrete system a full 
multigrid algorithm was developed. 


1. INTRODUCTION 


The single-group, steady state, isotropic form of the Boltzmann transport equation for one 
dimensional slab geometry is given by ( Lewis and Miller [6] ) 


ip(x, n’) dff = q{x, n) 

-1 > 

ip{a,n) = for /x>0 

ip{b, n) = g 2 (ff) for /x < 0 


( 1 . 1 ) 


where x € [a, 6 ] and p 6 [— 1 , 1 ]. When a t — » oo and — > 1 , which is the so called diffusion I unit, 
this equation becomes singular. The limit operator (/ — P), where P denotes the operator 
Pxp(x, ff) = | /Cj ip(x, has in its nullspace all functions that are independent of angle /x. 

Moreover, in this limit transport theory transitions into diffusion theory in the following way. 

Let e be a small parameter. Substituting a t by j, er s by (^ — £(T a ), where o a is 0 ( 1 ), and scaling the 
right hand side by e, equation (1.1) becomes 

+ e 7 _ (e -£CTa ) = £q ^‘ ( L2 ) 

In addition, it is assumed that the external source q is independent of /x. As a consequence of this 
parameterization the diffusion limit is now equivalent to the limit e — * 0. By expanding the solution 
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of (1.2) as „ . 

rjr s'* : •• 

xp{x , n) = (x,fi) + Y £j i> j (*> A 1 ) 

3 = 1 

it can be shown ( Larsen [2] [3], Larsen, Morel and Miller [4] ) that some mean-free-paths away 
from the boundary the zeroth order term, ip°(x, n), is independent of n and is a solution of the 
following diffusion equation 

- + cA(x) = 5<l) ' (1,3) 

Thus, in the diffusion hmit the solution of the transport equation will converge to the solution of a 
diffusion equation. 

For the numerical solution of (1.1) it is important to find a discretization that has the same 
property; i.e., for diffusive regimes ( a t large, % « 1 ) the difference scheme for the transport 
equation must approximate a diffusion operator. 

In the last two decades a large amount of work was dedicated to developing special 
discretizations for the transport equation that have the correct behavior in the diffusion limit. 
Among them are the Diamond Difference scheme [6], the Linear Discontinuous scheme [1], and the 
Modified Linear Discontinuous scheme [5]. In the one-dimensional case their implementation is 
straightforward, but their extension to higher dimensions is difficult. 

In this paper we try to develop a general framework for finding discretizations for the transport 
equation that have the correct behavior in the diffusion limit. In Section 2 we describe the discrete 
ordinates Sn discretization in angle and a least squares finite element discretization in space and 
discuss why this simple approach does not behave properly in the diffusion limit. A scaling 
technique for the transport equation is introduced in Section 3 that yields a least squares 
discretization with the proper diffusion limit. In Section 4 we present numerical results based on a 
f ull multigrid solver for the resulting discrete system. In Section 5 we draw conclusions and suggest 
further applications of the scaling technique. 

2. DISCRETIZATION 


For the discretization in angle we use the standard discrete ordinates Sn method. In the case of 
one-dimensional Slab geometry, this is a Galerkin discretization with normalized Legendre 
polynomials as basis. That means we are looking for a flux solution that has an expansion in the 
first N normalized Legendre polynomials, 

ip(x,n) = Y, M x ) Piit 1 )- ( 21 ) 

i=0 

Since the normalized Legendre polynomials form an orthonormal basis for 

L 2 ([-l, 1]), the moment coefficients <f>i are given by the following integral, which can be 




r 
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written exactly as a sum by using a Gauss quadrature rule. We have 


M x ) = i>{x,ii)Pi{n)dn 

-1 


( 2 , 2 ) 


N 

= p iM- 

3 = 1 

Here fij denotes the Gauss quadrature points and uij denotes the Gauss quadrature weights. 
By introducing the vector notation 

S' = (ip(x, Hi),..., ip(x,fJ, N )f ; $ = (<j> 0 (x),... ,<^ n _i(x)) t 

and defining matrices T and Q as 

[T] itj = Pi-i(fij); Cl = diag(u»i, . . . ,u N ) 

the relationship (2.1) and (2.2) between the flux Si and the moments $ can be written as 

$ = TCIV 
— T ,t $. 

As a result of the Galerkin discretization of (1.1) with the ansatz (2.1) we obtain the <SW 
equations 

( Pi \ 


(2.3) 

(2.4) 


m = e 


\ 


VN J 


r\*T * 

-g| + I £ -(1 -eV„)fii; = £ 2 £, 


(2.5) 


where 


R = (1,... ,1) T (wi,...,wjv). 

When we insert (2.4) into (2.5) and multiply by TCl from the left, we get the moment equations 



e 2 cr a 

eb o£ 

0 

0 

M4> = 

£ ^fx 

0 

i 

£b ^fi 

eb 'Tx 

1 

0 

eh r§i 


_ 



- 


$ =e 2 q 


( 2 . 6 ) 


with 




3 + 1 


\/4(i + i) 2 -i‘ 

Normally, the computations are done in the flux representation (2.5) since in this representation 
the boundary conditions are equal to simple Dirichlet boundary conditions. However, as we will see 
later, the moment equations are very useful for theoretical insight. In the following the flux 
operator is denoted by L and the moment operator by M . 
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For the spatial discretization of the Sn equations we use a least squares finite-element method 
based on the functional 

b 

F($) = j (n (L2 - q) , LSi - q) m ^ dx, (2.7) 

a 

and piecewise linear continuous elements rj k as basis functions for each component of 2.. In the 
following S h denotes the finite dimensional space = span {r/fc} and (5^*)^ denotes the space of 
TV-tuples whose elements are in S h . 

The advantage of this approach is that a least squares discretization converts the Sn equations, 
which are a coupled system of first order equations, into a self-adjoint variational formulation. 
Based on this variational formulation Multi-Level Projection Methods [7] can be applied in order to 
guide the development of a multigrid solver for the resulting discrete system. 

Unfortunately, this discretization does not behave correctly in the diffusion limit. In order to 
explain this fact, we use the moment equations, since (f)$ = ift 'm the diffusion limit. Further, it is 
easy to see by using the relationship (2.3), (2.4) and the identity T t TQ = I that 

b 

min l (Q (l^_ — qj , L}&_ — q) „ dx < — > 

±e{s h ) N J ' v —/ K‘ 

a 

b 

min [ (At®- q,M$- q) dx, 

a 

which justifies why it is also possible to look at the least squares discretization of the moment 
equations. 

In the S 2 case with a a = 0, for example, a least squares discretization of the moment equations 
results in the following discrete system 



where, for example, (r/, rf) is a mass matrix and (t]\ tj) a stiffnes matrix with elements 

b ^ * / \ 

[(V,V)] k ,i = f Vk(x)vi(x)dx, [(T}',v)) k ,i ~ J ^~ Vl ^ dx 

a a 
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For sufficient small e we have 


i-i 


fa.*?) +3 -faW) 


= (»/.»/) 1 - -^-(Vy 7 )) l W,v')(VyV) 1 +0(e 4 ). 


Plugging (2.9) into (2.8) and dividing by e 2 leads to 


(2.9) 


< \ [(v',v') - (v',v)(v,v) 1 (v> 7 )')] 

O V ' 

l (-1) 

+y (v\ v) (v, v)~ l (v'y v') (v, *?) _1 (»?, v) + 0(e A ) J (2.10) 

= ~j(.V , ,v)(VyV)~ 1 (v , ,Qo) + 0(e 4 ). 

In the limit e — > 0, the solution approaches a valid discretization for the diffusion equation (1.3) 
only if the term (*1) vanishes identically. For piecewise linear, continuous basis elements this term 
does not cancel out and becomes the leading term in the equation. Consequently, in the diffusion 
limit (2.10) is an approximation for <j>Q = 0, which results in a linear solution, connecting the 
boundary conditions. In general, the term (*1) does not have the proper behavior unless the mass 
matrix, ( t),t ?), is lumped, that is, replaced by a diagonal matrix. 


3. SCALING 


A closer look at the moment equations (2.6) shows that this system is unbalanced. There are 
0(e 2 ), 0(e) entries as well as 0(1) entries. The idea is to scale this system before the discretization. 

First, let us consider the case cr a ^ 0. In our inner product the adjoint moment operator, when 
homogeneous boundary conditions are assumed, is given by 



| E 2 (?a 


0 

0 

M* = 

1 

... o o- 


- g6 i £ 
i 

0 

- £b *£ 

1 

Scaling the moment equations by 



\ 


l ‘I 

and forming the normal equations results in 


(3.1) 


M*SM$ = e 2 M*Sq <=> 
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Note that (3.2) already has the correct limit equations for the moments on its diagonal and all first 
order derivatives are eliminated. 

Applying a Galerkin discretization to (3.2) and forming the Schur complement leads after 
division by e 3 to the following discrete equation for 

|< 7 a (» 7 ,’ 7 ) +^( 77 ', 77 ') + G(e 2 )| </>o = ( 77 , 90 ) + 0{e 2 ), 

which is a valid discretization of the corresponding diffusion equation (1.3) in the limit e —* 0. 

When we define b kJ = ey 77 k (x), where denotes the j-th canonical unit vector of IR N , we can write 
the Galerkin discretization of (3.2) as follows 

b 

Vh., I (M'S (MS- 1) = 0. (3.3) 

a 

Assuming homogeneous boundary conditions and splitting S = V Sy/S, (3.3) is equivalent to 

6 

V6 fcj J (VS (M$ - q) , VSMb kJ ) mN dx = 0 

a 

b 

<=► .ffijs, / ( M * - i) , v'S (Mi - dx, 

a 

which is a least squares discretization of the moment equations, scaled by VS. Consequently, a 
least squares discretization of the moment equations, scaled by VS , also has the correct behavior in 
the diffusion limit. 

Notice that (3.2) has the proper behavior for any a a ^ 0. The second equation contains j- on 
both sides and yields the proper solution for 4>i as a a — > 0 . 
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On the other hand, if a a = 0 the scaling (3.1) cannot be applied. In this case, scaling the 
moment equations by 

/ 1 


S 0 = 


a 


a 


\ 


\ 


■-/ 


with 


a = e p , where p > 2 
and forming the normal equations results in 


( 0 \ 
-zV-t 
0 


A Galerkin discretization of (3.6) leads to the following discrete system 

\ 


e 2 abl(r)',T)') 

Ai \ 

, A10 

An y 


$ h = 


( 0 

£ 3 bo(v\ Qo) 
0 


V 


} 


where 


An = (-£ab 0 (T),Ti , ),E 2 ab 0 b 1 (ri , ,r)'),0,...) 


(3.4) 


(3.5) 



-eo6 0 £ 

-E 2 ab 0 bi 

0 



a-e 2 (b 2 0 + ab 2 x )j^ 

0 

—e 2 a6i6 2 ^ 


-e 2 abobi-§p 

0 

a - e 2 a(6? + bj)-§p 

0 


0 


0 

a - £ 2 a(b\ + 6 2 ) 






■J 


$ 


(3.6) 


(3.7) 


Ao = (£Ctb Q (ri,r) , ),£ 2 ab 0 bi(ri\ri , ),O y . . .) 


and An == 

/ <*(*?.»?) + £ 2 (*>o + a *div',v') 


0 a{r),T)) + e 2 a(b 2 l + 6^(17', »?') 0 


e 2 afe] 62 ( 77 ', 77 ') 
0 


0 


e 2 ab\ fev (77' , Tf r ) 


£ 2 06162 ( 7 / , tj') 0 

£ 2 a&2fe(T?',V) 

0 ( 77 , *?) + e 2 ot( 6 2 +bl) 0 

0 a(f7,r?) + £ 2 a(6| + 6 2 ) 




For sufficient small e we have 


1 


( MW 


4-1 - — 

A U ~ £2 


\ 


^(f) 


a 


+ Qi + Q2 + 0(-t), 


399 


with Q j = O(^) and Q 2 — O(^). Using this expansion when forming the Schur complement in 
(3.7) we get 


} 0o = — AqiA^ 1 

( £ 3 b 0 {r}'.,q ( ) ) \ 
0 


/ 


which after some algebra becomes 

{£ 2 abl(r}',T]') - a 2 (r},ri')(T)',r) , )- 1 (Ti,r}') + 0(e 4 c*)} 4> 

= -e 2 a(7?,7/')(»7 , ) T7 , )' 1 (^.9o) + 0(a 2 ) * 


(3.8) 


Because of (3.5) we have jU — > 0, so that (3.8) is a valid discretization of the corresponding 
diffusion equation in the diffusion limit. Using the same argument as above it follows that the least 
squares discretization of the moment equations, scaled by \/<So, also has the correct behavior in the 
diffusion limit. 

As mentioned before, the computations are done in the flux representation. Therefore, we need 
to transfer the scaling of the moment equations to a scaling for the S N equations. By means of the 
relationship (2.3) and (2.4) it follows that a least squares discretization of the moment equations, 
scaled by y/S is equivalent to a least squares discretization of the Sn equations, scaled by 

T t VSTQ = — + y/i(I - R ) . 

Ve°a 

A further multiplication by ^feo~ a leads to the following scaling in the case <r a / 0 

R + e^/oa (/ - R ) . (3.9) 

Similarly, in the case o a = 0 the least squares discretization of the moment equations, scaled by 
y/S o with p = 4, is equivalent to a least squares discretization of the S N equations, scaled by 


R + e 2 (/ - R) . 


(3.10) 


In order to avoid an “if else” in the computations it is possible to combine the scalings (3.9) and 
(3.10) to : ' r - : ri _; ;. ; .: 

R + (ty/o\i. + (I — R) . (3.11) 


4. NUMERICAL RESULTS 


For the solution of the discrete system that results from a least squares discretization of the S N 
equations scaled by (3.11), a full multigrid in space algorithm with 

• standard coarsening in space by doubling the mesh width, 

• /x-line red-black smoothing, 
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Scalar Flux: ( mesh size h = 1 .25 ) 



• full weighting, 

• linear interpolation, 

was developed. The full multigrid process starts by solving the problem on the coarsest level and 
uses this solution as a starting guess for the next finer level, where a single V-cycle is performed. 
Recursively, the solution process proceeds from coarser levels to finer levels by halving each grid 
cell, using the coarse level solution as a starting guess and performing a single V-cycle on the next 
finer mesh. This algorithm yields V-cycle convergence rates that are below 0.09. Therefore, by 
performing one full multigrid V-cycle, a solution with an error on the order of the truncation error 
is obtained (cf. [7]). 

As test problem we used the same problem that was used by Larsen, Morel and Miller in [4], 
which is shown below: 

-I- 100 Vb - ioo = 0.01 

< i*=i 

tpj (0) = 0 for [ij > 0 
k Vb(10) = 0 for fij < 0 

In our parametrization (1.2) this implies e = 0.01, o a = 0 and q = 1.0. The exact solution of the 
corresponding diffusion equation is 

3 

4>{x) = — -x 2 4- 15x, 
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Scalar Flux: ( mesh size h = 1.25 ) 



Figure 4.2: Scalar flux solution of problem (4.2) 


which is plotted in Figure 4.1 in solid. The least squares solution of the scaled Spj equations, 
computed by one full multigrid V-cycle, is shown in Figure 4.1 by the crosses. We see that this is a 
very satisfactory result, especially when we take into consideration that we used a mesh size of 1.25, 
which is much larger than e. 

Finally, we mention that the least squares discretization of the S N equations without scaling will 
give the zero solution for problem (4.1), indicated by the stars in Figure 4.1. 

For the sake of completeness we present in Figure 4.2 the results for the test problem 

+ 100 ^ ~ 99-99 = 0.01 

tpj (0) = 0 for Hj > 0 
V>j(10) = 0 for fij < 0 

where cr a = 1.0, e — 0.01, q = 1 — |x 2 + 15x. The exact solution of the corresponding diffusion 
equation is the same as for problem (4.1) and is again plotted in Figure 4.2 in solid. The least 
squares solution, computed by 1 full multigrid V-Cycle is given in Figure 4.2 by the crosses and the 
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solution for the least squares discretization of the S N equations without scaling is given by the stars. 


5. CONCLUSIONS 

Prom the analysis in Section 3 and the numerical results presented in Section 4, we conclude that 
the least squares discretization of the scaled S n equations has the proper behavior in the diffusion 
limit. 

Further, we point out that the scaling can be used in the case of nonhomogeneous material, 
where o t or, equivalently, e are discontinuous, because the equations are only scaled from the left 
side so that no derivatives are applied to the scaling operator. 

Adaptive refinement can be combined with the full multigrid solver in a natural way. Areas of 
new refinement can be identified by examining the difference in the solution for two consecutive 
grids. This is especially important for nonhomogeneous material, where interior layers may exist. 

Numerical results show that with a slightly different scaling both a Galerkin finite element 
formulation with piecewise linear elements and an Upwind Difference discretization of the Sn 
equations also have the correct diffusion limit. We believe that this scaling approach will result in a 
general framework for the development of discretizations that posses the correct diffusion limit. 

Finally, we hope to apply the scaling techniques developed here to higher dimensional problems. 
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Abstract 

We study the potential performance of multigrid algorithms running on massively parallel 
computers with the intent of discovering whether presently envisioned machines wall provide an 
efficient platform for such algorithms. We consider the domain parallel version of the standard V- 
cyde algorithm on model problems, discretized using finite difference techniques in two and three 
dimensions on block structured grids of size 10 6 and ICP, respectively. Our models of parallel 
computation were developed to reflect the computing characteristics of the current generation of 
massively parallel multicomputers. These models are based on an interconnection network of 256 to 
16,384 message passing, "workstation size" processors executing in an SPMD mode. The first model 
accomplishes interprocessor communications through a multistage permutation network. The 
communication cost is a logarithmic function which is similar to the costs in a variety of different 
topologies. The second model allows single stage communication costs only. Both models were designed 
with information provided by machine developers and utilize implementation derived parameters. 
With the medium grain parallelism of the current generation and the high fixed cost of an 
interprocessor communication, our analysis suggests an efficient implementation requires the machine to 
support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a 
communication must be significantly reduced through an alternative optimization technique. 
Furthermore, with variable length message capability, our analysis suggests the low diameter 
multistage networks provide little or no advantage over a simple single stage communications network. 

i Research at Princeton University partially supported by the National Science Foundation, Grant No. 
CCR-8920505, the Office of Naval Research, Contract No. N0014-91-J-1463, and by D1MACS (Center 
for Discrete Mathematics and Theoretical Computer Science), a National Science and Technology 
Center, Grant No. NSF-STC88-09648 
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0, Introduction 

In the current generation of massively parallel (MP) computers there is a 
convergence towards a common set of archi tectu ral characteristics. From the 
standpoint of a computational scientist, this convergence presents the opportunity 
to study the class of machines as a whole, in order to determine whether or not they 
can be efficient platforms for the solution of various computationally intensive 
tasks. 


We studied the potential use of these machines for the solution of multigrid 
algorithms. Our study included a wide range of multigrid algorithms and 
encompassed several different architectural characteristics. 

In this paper we present the architectural ideas suggested by this study which 
would enable the current generation of MP machines to become efficient platforms 
for various multigrid applications. 

Our approach was to develop a set of models of parallel computation based on 
the common characteristics of the current generation of MP machines. We 
implemented a representative set of structured multigrid algorithms on these 
models. We then looked at the performance predictions and tried to understand 
their implications. 

The remainder of the paper is organized as follows. First, the models of 
computation are developed, followed by a brief description of the multigrid 
algorithms and their implementations. Next, the performance predictions are 
presented and finally their implications are summarized. 

1. The Current Generation of Massively Parallel Computers 

The power and availability of RISC microprocessor chips have increased 
dramatically over the past several years. The proliferation and decreased cost of 
these "workstation-size" processors have spawned the current generation of 
multicomputers. Some of the major architectural similarities of this generation are 
summarized below. 


Multicomputers These multicomputers are interconnection 
networks of physically distributed processors and memory, linked in a 
variety of different topological configurations. 

Powerful Microprocessors The processors are generally "off the 
shelf" single chip RISC microprocessors. They can perform integer and 
floating point computation significantly faster than the bit-serial 
processors which characterized many machines of the previous 
generation. 


Medium Grain Size The increased size, cost and speed of the 
individual processing elements has delineated a medium grain size for 
the current generation. Most of the machines are targeted for the range 
of IK processors, with larger machines possibly ranging up to 16K 
processors. 

Slow Network Communication The current machines generally 
exhibit slow interprocessor communication speeds relative to on-chip 
events. This is frequently a result of handling the network 
communications processing in the software layer. 

Single Program Multiple Data Mode of Execution Unlike the 
more rigid SIMD and asynchronous MIMD patterns of the previous 
generation most of the newer machines execute the same program on 
each processing element with different data, enforcing synchronization 
only as required by interprocessor communication. 

The current generation includes the CMS by Thinking Machines, a network 
of Sun SPARC processor nodes, potentially with vector accelerators, connected in a 
fat tree topology; the Touchstone Delta, developed by Intel and Caltech, a three 
dimensional mesh of two Intel i860s per node; the Paragon by Intel, a 3D mesh 
topology with one to four i860 processors per node; the Kendall Square Research 
machines, a hierarchy of concentric rings with shared virtual memory, with two 
custom designed chips per node. Cray Research is building a machine with DEC 
Alpha processors connected by a yet unrevealed topology. 


2. Models of Parallel Computation 

The models of parallel computation presented in this paper were designed to 
capture the salient characteristics of the current generation of massively parallel 
computers. The guiding philosophy behind the development of these models was 
to strike a reasonable balance between machine independence and practicality, 
simplicity and accuracy. The goal is to find a set of models which facilitates efficient 
algorithm design, and ideally, provides feedback into the machine design process 
itself. 


The models of computation reflect the paradigm of the multicomputer: 
processors and memory are physically distributed throughout an interconnection 
network. Motivated by the large disparity between the speeds of on-chip and 
network events, the models reflect the costs of a two level memory hierarchy. The 
cost of a local memory access is included in the cost of an arithmetic operation while 
the cost of a remote memory access is treated separately. The models were 
parameterized to facilitate analysis under different ratios of problem to machine 
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size. In addition, this parameterization allows the incorporation of changes in 
technology, such as increases in on-chip computation speed or a decrease in networ 
communication latency. The models assume the processors operate in a Single 
Program Multiple Data mode of execution. 

The analysis of this paper utilizes three of the models developed. The 
characteristics of these models are similar, differing only in their treatment of 
communication costs. The different treatment of communications costs ranges 
from assigning a simple topologically-blind cost for a network communication to a 
more complex function which potentially provides more accuracy. The cost of a 
floating point computation, treated similarly in each of the models, is separated 
from the cost of a remote memory access. 


3. Communication Costs 

Accurately and simply accounting for inter-processor communication is the 
toughest challenge in the development of a useful model of parallel computation. 
The three alternative treatments presented here are based on the common 
components of network communication costs exhibited by the current generation o 

multicomputers. 

1. Fixed Start-Up Costs There is a large fixed start-up cost associated 
with any message passing, packet-based communication. To execute a 
network communication often requires a processor interupt, comp ete 
with a full context switch. The message must be packaged and tagged 
with destination information and injected into the network. 

2. Variable Cost Per Node This component of communications 
cost is the time to route the message through the network to its 
destination. Cut-through, circuit switched routing, a common general 
technique, for example, imposes a per node path formation cost. These 
routing and path formation costs are actually a complex function of the 
routing algorithm, the communications pattern and network topology. 
Taken in sum, these comprise the different aspects of contention. In 
these models, this complex distance-related component is simplified. It 
is approximated as the product of a machine-dependent constant and 
the number of processor nodes along the required communications 
path. Sensitivity analysis is used to potentially understand the impact 
of different degrees of contention. 

3. Spooling Costs Per Node A third component of network 
communications costs is the cost to physically spool the message 
through the network. Experimental results suggest the spooling cost 
can be approximated by a linear function of the message length, up to a 
message size of 1000 words. In these models the spooling cost of a 


408 


message is treated as the product of a per-word cost and the length of 
the message in four-byte words. 


4. Fixed Costs of Receipt Finally, receiving a message generates a 
set of costs on the receiving processor, analogous to those required by 
the originating processor, namely, interrupts, context switches and 
message unpacking. 


4. Three Models of Computation 

The three models of computation used in this analysis were based on the 
common architectural characteristics discussed above and differ only in their 
approximation of the components of network communications costs. The 
following descriptions assume the models use only fixed constant size messages to 
accomplish all network communication. 

The GAP Model The first model, the GAP model, is a simple, topologically 
blind model which grew out of a set of discussions with a group of researchers at 
Berkeley. So named because of the "gap" in processor utilization caused by the 
initiation of a network communication, the GAP model charges one fixed cost for 
every communication regardless of its source and destination. 

The LOG Model The second model, the LOG model, introduces a variable 
topologically-based cost to the fixed cost component. The LOG model assumes the 
processors are physically connected in a 2D mesh with an overarching multi-stage 
permutation network. To approximate the distance a message must travel, the LOG 
model uses the logarithm of the Manhattan distance (or Lj norm) between the 
sending and receiving processors on the 2D mesh. The motivation for the use of 
this function is twofold. First, it realizes the lower bound on path length between 
any two nodes in a network with a bounded branching factor. Second, it generally 
approximates the behavior of a variety of networks which realize logarithimic 
communication distances, such as butterfly and shuffle-exchange networks. Thus, 
communications cost in the LOG model, with fixed length messages, is 
approximated by the following function. 

Communications Cost = Fixed + Variable * Distance 

where Distance = Log(Manhattan Distance) 

The Single Stage Model The Single Stage model also treats the cost of a 
communication as the sum of fixed costs and a per-node distance dependent cost. 
This model, like the LOG model, also assumes the processors are physically 
connected in a two dimensional mesh. In the single stage model, however, there is 
no overarching multi-stage network. All communication is accomplished by single 
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or multiple hops along the physical connections of the 2D mesh. The motivation 
for this model was the possibility of quantifying the impact of a multi-stage network 
on performance for various applications. The cost of a communication with fixed 
length messages, therefore, is approximated by the following function. 

Communications Cost = Fixed + Variable * Distance 
where Distance = Manhattan Distance 


Model Parameters 

The three models, with fixed length messages, are parameterized by a fixed 
component and a per-node variable component of communication, the cost of a 
floating point operation, and the machine size (number of processors). In this 
analysis eight different pairs of values for fixed and variable cost per node are used. 
Five pairs are used to represent different possible conditions in the LOG and Single 
Stage models, while the last three, where the variable costs are zero, represent the 
similar conditions in the GAP model. The values in the table below were based on 
timings of random end to end communication patterns on an early release of the 
CM5 performed by both an internal Thinking Machines applications group and 
more independent sources. 


Table 1 

Model Parameters 


Fixed 

Variable 

Machine State 

2500 

200 

Current 

1000 

200 

Current-Low 

500 

200 

Potential 

500 

100 

Potential 

100 

50 

Ideal 

5000 

0 “ 1 

Current 

3600 

0 

Current Low 

1000 

0 

Potential 


The first two pairs of values approximate the current fixed and variable cost 
on working machines running "off the shelf' software. The first pair (2500, 200) is 
an averaged approximation while the second pair is more idealized. With a 33Mhz 
clock, such as the current clock speed of the SPARC chip used in the CM5, for 
example, a 2500 cycle fixed cost and a 200 cycle per-node variable cost translates to 
approximately 75 and 6-7 microsecond costs, respectively. The next two pairs 
represent reductions in cost which may be possible within this generation. The fifth 
pair represents an ideal. The last three pairs attempt to replicate the three different 
states within the GAP model. The cost of a 32-bit floating point operation, in 
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machine cycles, is estimated at 6 cycles. While on-chip computing speeds are rapidly 
increasing, this value attempts to approximate the current state without accelerators 
which have a "peak" rate of two operations per cycle. 

When the message size is allowed to vary up to approximately 1000 words, a 
fourth parameter, the per- word spooling rate, is introduced. Experimental data 
suggested a 4 cycle per-word cost would be a reasonable value, with a sensitivity 
analysis up to approximately 12 cycles per-word. 

Parallel Machine Size 


The current generation of massively parallel machines is characterized by 
"medium-grain" machines typically consisting of approximately 256 to 16K 
processors. This analysis considers machines with 28, 2 10 , 2 12 , and 2 14 processors. 

5. Multigrid Algorithms and Implementations 

The analysis presented here considers the standard V and F-cycle in two and 
three dimensions. This analysis considers only the simplest problems and solution 
schemes: model problems are considered on structured meshes spanning square and 
cubic domains. Explicit weighted Jacobi schemes are used to solve problems 
discretized using second order finite difference techiniques. The hierarchy of 
structured meshes is constructed using a coarsening ratio of two in each dimension. 
The cycling schemes execute two relaxation sweeps onthe downstroke and one on 
the upstroke. 

The problems were implemented on the parallel models using simple, 
practical domain partitioning strategies. In two dimensions the finest mesh was 
simply partitioned into load-balanced square subdomains and mapped to the 
analogous processor in the 2D mesh of processors. In three dimensions, the domain 
was analogously partitioned and the processor mapping was only slightly more 
complicated and was within a factor of two of optimal. 


6. Analysis Overview 

The remainder of this paper presents the results and implications of the 
implementation of the standard multigrid algorithms on the three models of 
parallel computation. The following two sections present the performance 
predictions for the two and three dimensional V-cycle when fixed length messages 
are used to execute all of the required network communication. Next, the results of 
the same analysis are repeated with variable length messages where the message 
size is allowed to vary up to 1000 words. The results of an implementation of the 3D 
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V-cycle on the Single Stage model are then presented. Finally, the implications of 
the set of predictions are summarized. 


7. The Standard V-cycle in Two Dimensions ' 

The performance predictions for the two dimensional V-cycle on both the 
LOG and GAP models were not encouraging. On moderate sized machines, those 
with IK to 4K processors, with a 2500 fixed communications cost (approx. 75 
microseconds) , the models predicted speed-ups of only 55 times over the serial 
implementation. For larger machines, the speed-ups do not even reach 200 times. 
The table below shows the speed-ups of the V-cycle for different machine sizes 
under different assumptions of fixed and variable communications costs. The 
problem size is 1,000,000 points or 1000 points per dimension. 

Table 2 
Speed-Up 

Two Dimensional V-cycle 
with Fixed Length Messages 

N2 = 1,000,000 


GAP and LOG Model Predictions 


Processors 

256 

1024 ' 

4096 

16,384 

Fixed, Variable 





2500, 200 

27.1 

55.1 

103.2 

172.6 

1000,200 

58.3 

125.5 

238.6 

387.1 

500,200 

94.4 

218.8 

424.0 

660.9 

500,100 

94.9 

223.5 

450.5 

755.0 


190.4 

585.7 

1462.1 

2881.7 


19.5 

39.3 

74.4 

128.6 

3600,0 

19.8 

40.4 

79.3 

147.0 

1000,0 

58.7 

128.6 

255.4 

453.2 


Because the information provided by these models attempts to bridge the gap 
between abstract models of computation and machine-dependent benchmarks, 
interpreting the data is not straightforward. From a theoretical perspective these 
speed-ups are far from linear. On the other hand computing the wall clock time 
associated with these predictions, then scaling these model problem times to reflect 
the increased complexity of actual applications, produces running times which are 
unacceptably slow. 

If the fixed cost of a communication can be reduced to 500 cycles or 15 
microseconds with a 33MHz clock, the models predict speed-ups in the range of 200 
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times. Only in the ideal case where the fixed cost of a communication is 100 cycles 
or approximately 3.3 microseconds, do the speed-ups become somewhat attractive. 

These discouraging predictions are a result of very high communications 
latencies. With a fixed problem size of 1,000,000 points, in this range of processors, 
the fine grid communications costs dominate both the cost of the computation and 
the cost of coarse grid communications. 

Very fast processors and relatively slow network communication create very 
poor processor utilization in this range of processors and problem sizes. The 
increased cost of the microprocessors in these machines makes efficiency an 
important performance criterion. We define efficiency here as the ratio of the time 
spent on computation to the total time on both computation and network 
communication. The table below shows the efficiency predicted by the models for 
the two dimensional V-cycle using the same eight pairs of values for the fixed and 
variable cost of a network communication. 

Table 3 

Efficiency 

Two Dimensional V-cycle 
with Fixed Length Messages 


N2= 1,000,000 


GAP and LOG Model Predictions 


Processors 
Fixed, Variable 


1024 

4096 

mmm 

2500, 200 

10.6% 

5.39% 

2.55% 

1.13% 

1000, 200 

22.77% 

12.29% 

5.91% 

2.53% 

500,200 

36.88% 

21.44% 

10.51% 

4.32% 

500,100 

37.10% 

21.90% 

11.17% 

P 4.95% 

100, 50 

74.41% 

57.38% 

36.36% 

17.54% 

5000,0 

5.62% 

2.80% 

1.33% 

.60% 

3600,0 

7.63% 

3.85% 

1.84% 

.85% 

1000,0 

22.94% 

12.60% 

6.33% 

2.97% 


Both the LOG and the GAP models predict very low efficiency levels when 
the fixed cost of a communication is high. With a fixed cost of 2500 cycles, small to 
modest sized machines, consisting of 256-1024 processors, reach only 5%-10% 
efficiency. With a fixed cost of 1000 cycles (approximately 30.3 microseconds using a 
33Mh clock), efficiency is still only 10%-20%. Driving the fixed cost down to 500 
cycles (15 microseconds) produces more reasonable levels of 20%-30% for modestly 
sized machines. To reach 40% -60% efficiency where the machine begins to leverage 
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the power of these new microprocessors, the fixed cost needs to be reduced all the 
way down to the 100 cycle range (3.3 microseconds). 


8. The Standard V-cycle in Three Dimensions 


The implementation and analysis of three dimensional problems differs from 
the two dimensional analysis in several ways. First, the additional dimension 
increases the computational burden by a factor of 0(N), to 0(N 3 ), while increasing 
the required communication by a factor of N/PV6. Second, mapping the three 
dimensional problem domain to a two dimensional machine model tends to 
increase not only the complexity, but the distance of interprocessor 
communications. Third, the problem size in the analysis is increased by a factor of 
1000, while still considering the same range of machine sizes. 

The LOG and GAP models predict only slightly improved levels of 
performance for the three dimensional V-cycle. Table 4 below lists the speed-ups 
predicted by the models for three dimensional problems with one billion points. 

Table 4 

Speed-Up 

Three Dimensional V-cycle 
with Fixed Length Messages 


N 3 = 1,000,000,000 


Processors 
Fixed, Variable 

256 

1024 

4096 

16,384 

2500, 200 

49.1 

130.6 

338.1 

859.3 

1000, 200 

86.7 

240.3 

636.8 

1632.5 

500,200 

116.2 

333.7 

902.7 

2332.2 

500, 100 

129.5 

389.2 

1102.2 

2969.2 

100,50 

202.7 

702.6 

2288.7 

6954.7 

5000,0 

30.1 

79.2 

205.3 

526.7 

3600,0 

39.9 

106.8 

279.7 

722.5 

| 1000,0 

102.3 

302.4 

855.2 

2333.5 


Generally, the predictions are not encouraging. The slight increase in 
performance is due to the increased amount of computation relative to both the 
amount of communication and the number of processors. For a 1024 processor 
machine with a 2500 cycle fixed communication cost, the LOG model predicts a 
speed-up of only 130 times. If the fixed cost of a communication drops to 500 cycles, 
this improves by a factor of 2-3. Only in the ideal case of a 100 cycle fixed 
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communication cost do the results approach acceptable levels for moderately-sized 
machines and design tool levels, with thousand-fold speed-ups, for very large 
machines. 

As with the two dimensional predictions, the sluggish predictions are due 
mainly to the overwhelming costs of the network communication. Table 5 below 
shows the efficiency levels which coincide with these speed-up predictions. 


Table 5 

Efficiency 

Three Dimensional V-cycle 
with Fixed Length Messages 

N3 = 1,000,000,000 


Processors 
Fixed, Variable 

256 

1024 

4096 

16,384 

2500, 200 

19.19% 

12.75% 

8.25% 

5.26% 

1000, 200 

33.85%% 

23.46% 

15.54% 

9.96% 

500, 200 

45.40% 

32.59% 

! 22.03% 

14.23% 

500, 100 

50.58% 

38.01% 

26.92% 

18.12% 

100, 50 

79.17% 

68.61% 

55.87% 

42.45% 

5000,0 

11.7% 

7.7% 

5.01% 

3.22% 

3600, 0 

15.59% 

10.42% 

6.83% 

4.40% 

1000,0 

39.95% 

29.53% 

20.87% 

12.42% 


The predictions of these models are in contrast to the asymptotic predictions 
of more abstract models of computation. Asymptotic analysis suggests the fine grid 
communications costs become negligible as the problem size gets large for a fixed 
range of machine sizes. These results suggest the huge imbalance between the cost 
of communication per word and the cost of a floating point computation causes 
communication time to dominate the time spent on computation, even with one 
billion points. 

The standard V-cycle algorithm alternates between computation and 
communication systolically, placing a heavy communications burden on a multi- 
stage interconnection network. On medium-grain multiprocessors, those with 256 
to 16K processors, for realistic problem sizes, local, fine-grid communication is 
predominant. By the time the grids have coarsened beyond one point per processor, 
only a small fraction of the computation remains. This magnifies the importance of 
a small fixed cost per word and de-emphasizes the importance of low variable per- 
node communications costs. Unfortunately, the models in the previous section 
show the demand for inexpensive local communication is answered in the current 
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generation of massively parallel machines by a high fixed communications cost 
producing discouraging levels of performance for both two and three dimensional 
problems. 

9. The Standard V-cycle with Variable Length Messages 

The previous analysis assumed all communication was accomplished 
through fixed length messages consisting of only a small constant number of words. 
Sensitivity analysis suggested that acceptable levels of performance required lower 
fixed costs per word. The low spooling rate per word exhibited by these machines 
motivates potentially lowering the average communication cost per word by 
transmitting large blocks of words per message. With large messages, the fixed cost 
of initiating a network communication can be amortized over a larger number of 
words, lowering the effective fixed cost per word. 

In the analysis of this section, spooling costs are added to the communication 
cost functions of the previous section. The cost of a message is a function of the 
distance and the length in words, and is the sum of fixed start-up and receipt costs, 
variable per-node costs and spooling costs. 

Experimental data suggest that approximating the total spooling costs as a 
linear function of message size is reasonable up to approximately 5000 words. The 
analysis here assumes a maximum message size of 1000 words and uses a per word 
spooling cost of 4 clock cycles. Approximating the spooling rate was accomplished 
with the help of timings provided by Pablo Tomayo of Thinking Machines, Inc. The 
rate was determined by a regression analysis on three node ping pong rates of 
message sizes ranging from 1 to 5000 words. Sensitivity analysis with rates up to 12 
cycles per word showed the results of this section are relatively insensitive to small 
changes in the per- word spooling rate. 

The predictions for the standard V-cycle algorithm in two dimensions with 
large message transmission were generally far more encouraging than the fixed 
message length predictions. The table below lists the speed-up and efficiency 
predictions for the same eight pairs of fixed and variable communications costs. 
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Table 6 


Speed-Up 

Efficiency 

Two Dimensional V-cycle 
with Variable Length Messages 

N2 = 1,000,000 


Processors 
Fixed, Variable 

256 

1024 

4096 

16,384 

2500, 200 

170.1 

66.49% 

314.0 

30.76% 

368.0 

9.12% 

354.5 

2.32% 

1000, 200 

208.3 

81.41% 

501.8 

49.17% 

709.4 

17.59% 

715.5 

4.69% 

500,200 

225.1 

87.99% 

626.9 

61.42% 

1027.0 

25.47% 

1083.1 

7.10% 

500,100 

338.1 

89.23% 

667.1 

65.36% 

1197.3 

29.69% 

1360.8 

8.92% 

100, 50 

446.1 

96.21% 

882.6 

86.47% 

2396.4 

59.44% 

3839.6 

25.1% 

5000,0 

132.5 

51.77% 

200.8 

19.67% 

216.5 

5.36% 

207.7 

1.36% 

3600,0 

152.8 

59.72% 

258.6 

25.33% 

294.2 

7.29% 

286.7 

1.88% 

1000,0 

213.8 

83.55% 

555.4 

54.42% 

882.9 

21.9% 

979.7 

6.42% 


These predictions show at least a factor of 6 speed-up on moderate-size 
machines and a factor of two speed-up on large machines over the fixed length 
predictions. For example, on a 1024 processor machine, with a fixed 
communications cost of 2500 cycles, with variable length messages, the speed-up 
predicted is 314 as compared to 55 on the models with constant message size. There 
is a corresponding improvement in the efficiency of 30% versus 5%. If fixed costs 
can be driven down to 500 cycles, the variable message length still provides 
approximately a factor of two improvement over the fixed length predictions. 

With large messages reducing the fine grid communication costs, the coarse 
grid communications costs, which are proportional to log2 p, grow to counterbalance 
the computational speed-ups provided by additional processors. The increase in 
speed-up as the number of processors gets large is less pronounced. In addition, the 
optimal number of processors implied by this trade-off occurs in a more reasonable 
range. For example, with fixed and variable costs of 2500 and 200 cycles respectively. 
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the models predict that the optimal number of processors for this computation is 
approximately 4900. 

With the three dimensional V-cycle, the models suggest that the ability to 
send variable length messages, up to 1000 words, produces a marked increase in 
solution speed in this range of processors, on problems up to one billion points. 

The table below shows the speed-ups predicted for the three dimensional algorithm 
by the LOG and GAP models. 


Table 7 


Speed-Up 

Three Dimensional V-cycle 
with Variable Length Messages 

N3 = 1,000,000,000 


Processors 

256 

1024 

4096 

16,384 

(tp/ tv) 





2500, 200 

253.3 

1006.1 

3965.4 


1000, 200 

253.9 _1 

1010.3 

3999.2 

15,555.7 

500, 200 

254.1 

1 

4010.6 

15,680.8 

500, 100 

254.2 

1012.3 

4016.9 

15,773.9 

100, 50 

254.4 

1013.7 

4029.3 

15,924.2 

5000,0 

252.5 

1000.3 

3922.4 

14,785.1 

3600,0 

253.0 

1004.2 

3953.3 

15,105.8 

1000,0 

254.1 

1011.5 

4011.8 

15,739.9 


With variable length messages, the high fixed communications cost can be 
effectively amortized over a large number of words, driving down the average cost 
per word to a more ideal range. Computation costs dominate the total execution 
time, producing almost linear speed-ups in this range of problem to processor size. 
Almost all of the complementary efficiency levels are above 90% for each of the 
eight fixed, variable communications cost pairs throughout the entire range of 
machine sizes. 

These results suggest the average communications cost per word can be 
driven down far enough through the efficient transmission of large messages to 
effectively leverage the increased computational speeds of the current generation of 
microprocessors. Thus, the ability to package messages into large blocks, up to a 1000 
word maximum, can potentially bring these machines closer to the goal of design 
tool performance on these problems. 
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10. The F-Cycle 

Performance predictions for the standard F-cycle were very similar to the V- 
cycle results. With both fixed, constant length and variable length message 
transmission, the F-cycle slightly outperformed the V-cycle. With fixed message 
lengths, this was due mainly to the reduced amount of fine grid communication of 
the F-cycle. With the ability to send large messages, the F-cycle out performed the V- 
cycle in three dimensions because of the reduction in the amount of required 
computation. 

11. Standard V-cycle on a Single Stage Machine 

The two and three dimensional V-cycle algorithms were implemented on the 
Single Stage model in order to try to determine the impact of a multi-stage network 
on the performance of multigrid algorithms. The single stage model assumes the 
processors are connected by a 2D mesh and all communication takes place along 
these physical connections. There is no overarching multi-stage communication 
network. The model is parameterized by the same machine dependent costs, 
namely, fixed and variable communications costs, spooling rates and floating point 
computation rates. The only difference in communications costs is in the variable, 
distance related cost component. In this model the distance a message must travel is 
simply the Manhattan distance (the Li norm) of the location of the sending and 
receiving processors on the mesh. 

The results in both two and three dimensions suggest the impact of a multi- 
stage network on performance is very small, regardless of the maximum message 
length. The table below shows the increase in total time caused by sending messages 
through the mesh connections rather than through the logarithmic multi-stage 
network defined by the LOG model. 


Table 8 

The Percentage Increase in Total Time for the 3D V-cycle 
Implemented on the Single Stage Model 
from the Time Required On the Multi Stage LOG Model 


Number of Processors 

% Difference M=1 

% Difference M-1000 

256 

5.88% 

.05% 

1024 

8.17% 

.19% 

4096 

11.54% 

1.19% 

16,384 

16.47% 

8.79% 


The table shows a less than 10% increase on moderate sized machines with 
fixed message length communication, where the fixed and variable costs of a 
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communication are 2500 and 200 cycles respectively. For machines with variable 
length message capability, the increase in total time is less than 1% for moderate 
machines. 

In three dimensions, the increase in communications costs alone with 
variable length messages is small, except on very large machines. The table below 
isolates the communications costs and shows the percentage increases. 

Table 9 


The Percentage Increase in Communications Time for the 3D V-cycle 

Implemented on the Single Stage LPSS Model 
from the Time Required On the Multi Stage LOG Model 


Number of Processors 

% Difference M=1000 

256 

4.38% 

1024 

10.87% 

4096 

37.30% 

16,384 

120.91% 


The table shows the small increase in communications costs with only a single stage 
permutation network. For very large machines the increase is only slightly over a 
factor of two. These results suggest that even for very large machines with fixed 
length messages, the addition of multi-stage networks does not seem to enhance 
performance enough to justify the additional machine complexity. 

12. Conclusions 

The performance predictions presented here suggest the fixed cost of a 
communication on the current generation of massively parallel machines needs to 
be driven down into the range of 15 microseconds to produce acceptable levels of 
performance. Ideally, the cost should be in the range of 3 microseconds. The 
computational speeds of the next generation of microprocessors appear to be 
increasing rapidly. Though these and other hardware advances may produce 
enhanced performance, they will certainly exacerbate the huge disparity between the 
speeds of on-chip and network events. Driving the average cost of a local 
communication appears to be imperative if these machines are to become efficient 
platforms for the the solution of multigrid applications. 

One way to accomplish this reduction in the average cost per word of a 
network communication may be through the efficient transmission of large 
messages. This capability would allow the fixed cost of a communication to be 
amortized over a large number of words. 

Finally, expensive multi-stage networks appear to have little impact on the 
performance of standard multigrid algorithms. In this range of problem to machine 
sizes, with both fixed and variable length message transmission, performance 
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degrades only slightly when communication is forced to traverse the physical 
connections of a 2D mesh of processors. 
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SUMMARY 


A numerical scheme to solve the unsteady Navier-Stokes equations is described. The scheme is 
implemented by modifying the multigrid-multiblock version of the steady Navier-Stokes equations solver, 
TLNS3D. The scheme is fully implicit in time and uses TLNS3D to iteratively invert the equations at each 
physical time step. The design objective of the scheme is unconditional stability (at least for first- and 
second-order discretizations of the physical time derivatives). With unconditional stability, the choice of 
the time step is based on the physical phenomena to be resolved rather than limited by numerical stability 
which is especially important for high Reynolds number viscous flows, where the spatial variation of grid 
cell size can be as much as six orders of magnitude. 

An analysis of the iterative procedure and the implementation of this procedure in TLNS3D are 
discussed. Numerical results are presented to show both the capabilities of the scheme and its speedup 
relative to the use of global minimum time stepping. Reductions in computational times of an order of 
magnitude are demonstrated. 


INTRODUCTION 


Although significant progress has been made in the last twenty years to numerically model many 
physical situations, most numerical schemes are limited to the prediction of steady flows. This limitation 
is particularly true in the field of computational fluid dynamics (CFD), where solutions to the Navier-Stokes 
equations for steady flows are now calculated on a regular basis. (See, for example, references [1-3]). 
An important factor that has lead to the increased use of Navier-Stokes solvers is the recent success in 
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reducing the computer resources necessary to obtain converged solutions. Perhaps the most promising 
work has been in the use of multigrid acceleration techniques. Convergence to steady state has been 
shown in '0[log(n)] work, where n represents the number of unknowns to be solved. This reduction in 
computer requirements has made steady-state solutions affordable to the practicing engineer. 

However, many physical phenomena (e.g., separated flows, wake flows, buffet) are intrinsically 
unsteady. The solution of unsteady problems in CFD has been limited to simplified subsets of the 
Navier-Stokes equations (panel methods, potential-flow solvers, and some limited use of Euler equation 
solvers). Unsteady Navier-Stokes calculations have been too expensive for routine use. 

The present approach is to apply an iterative procedure for the solution of an implicit equation, thus, 
the approach is called an iterative-implicit method. The concept is not new; in fact, many of the methods 
developed in the field of linear algebra for inverting large matrices are iterative. Within the field of 
CFD, similar work is discussed by Jameson [4] for unsteady flows and by Taylor, Ng, and Walters [5] 
for steady-state flows. The present approach is similar to that of Jameson in that a Runge-Kutta based 
multigrid method is used to solve the implicit unsteady flow equations. The Navier-Stokes equations have 
been treated in the present work, and Jameson’s implementation has been modified so that the robustness 
of the scheme is dramatically increased. 

A detailed description of the implementation will be followed by an analysis of the method and the 
numerical results from one- and two-dimensional test problems. 


TIME-DEPENDENT METHOD 


In the present work, a modified version of the thin-layer Navier-Stokes (TLNS) equations is used to 
model the flow. The acronym “TLNS” used here describes an equation set obtained from the complete 
Reynolds-averaged Navier-Stokes equations by retaining only the viscous diffusion terms normal to the 
solid surfaces. The effects of turbulence are modeled through an eddy-viscosity hypothesis. The Baldwin- 
Lomax turbulence model [6] is used for turbulence closure. For a body-fitted coordinate system (£, tj, 0 
fixed in time, these equations can be written in the conservation-law form as 


Q , x OF OG OH 0F V 0G V ^ 0H } 
~ df( J U '~dt + dri + d C dp d( 


( 1 ) 


where U represents the conserved variable vector and F, G, and H represent the convective flux vectors. In 
the above equation set F v , G v and H v represent the viscous flux vectors in the three coordinate directions 
(£, f], 0, and J is the Jacobian of the transformation. These equations represent a more general form of the 
classical thin-layer equations introduced in reference [6] because the diffusion terms in all three coordinate 
directions are includ ed in this fo rm. The Euler equations can easily be recovered from equation (1) by 
simply dropping the last three terms on the right-hand side. 

The temporal derivatives are cast as a fully implicit operator in physical time. For first- or second-order 
discretizations in time, this produces an unconditionally stable scheme, which allows the time-step size 
to be chosen based on the temporal resolution needed in the solution rather than limited by the numerical 


stability requirements. The fully implicit terms are iteratively solved with multigrid acceleration rather than 
direct inversion, which would be too costly for the nonlinear three-dimensional Navier-Stokes equations. 


IMPLEMENTATION OF TIME-DEPENDENT METHOD 


Original TLNS3D Method 


In the original TLNS3D program, a semidiscrete cell-centered finite-volume algorithm, based on a 
Runge-Kutta time-stepping scheme [1][7][8], is used to obtain the steady-state solutions to the TLNS 
equations. A linear fourth-difference-based and nonlinear second-difference-based artificial dissipation is 
added to suppress both the odd-even decoupling and the oscillations in the vicinity of shock waves and 
stagnation points, respectively. Both the scalar and matrix forms of the artificial dissipation models [9] 
are incorporated. 

In the steady-state implementation, the physical time T is replaced by a pseudo time t, which gives 

0 , T _i rr \ dF dG dH dF v dG v dH v 

-Tr (J u)= W + ^ + ac 'IT - ( ) 

At steady state, the left-hand side of equation (2) disappears, and the right-hand side (the residual) goes 
to zero, so that any stable scheme may be used to advance the solution in pseudo time. 

In the original TLNS3D program, the solution is advanced with a five-stage Runge-Kutta time-stepping 
scheme. Three evaluations of the artificial dissipation terms (computed at the odd stages) are used to obtain 
a larger parabolic stability bound, which allows a higher CFL number in the presence of physical viscous 
diffusion terms. Such a scheme is computationally efficient for solving both the steady Navier-Stokes 
and the steady Euler equations. The stability range of the numerical scheme is further increased with 
the use of the implicit residual smoothing technique that employs grid aspect-ratio-dependent coefficients 
[1][10][11]- 

Equation (2) can be rewritten to group the convective and diffusive terms from the right-hand side as 

Vol + C(U) - D P (U) - D a (U) = 0 (3) 

or 

where the equation has been multiplied through by the volume Vol and C(U), D P (U), and D a (U) are 
the convection, physical diffusion, and artificial diffusion terms, respectively. The implementation of the 
Runge-Kutta time stepping is shown by rewriting equation (3) as 

Vol Ut ~ U ° + - D°JU) - Df'(U) = 0 (4) 

a* At 

where the superscript k indicates that the given term should be evaluated at the fcth Runge-Kutta stage. 
The superscript indicates that the terms are evaluated with a linear combination of the values from 

previous Runge-Kutta stages. 
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The solution is advanced in pseudo time with the maximum allowable time step for each cell. Enthalpy 
damping, which has been used in several previous studies to accelerate the convergence of the numerical 
scheme, is not employed because the Navier-Stokes equations generally do not admit constant enthalpy as 
a solution. The efficiency of the steady numerical scheme is also significantly enhanced through the use 
of a multigrid acceleration technique as described in reference [1], The original TLNS3D program was 
extensively modified to facilitate solution of the flow fields over a wide range of geometric configurations 
through domain decomposition. A consequence of this work is the generalization of the boundary 
conditions of the program to easily accommodate any arbitrary grid topology. A detailed description 
of this capability is given in reference [12]. 


Time-dependent TLNS3D-MB 


The physical time derivative of equation (1) is approximated by a discrete operator of the form 
~ ^ L t U n+1 = = Jf[ a » U '‘ +I + E { U ^ Un '' (5) 

\m= 0 / 


to give 


L t U n+1 = S(f7 n+1 ) 


( 6 ) 


where E denotes the portion of the discrete physical time derivative that involves values from the previous 
time steps and S{U n+l ) denotes the discrete approximation to the right-hand side of equation (1). 
Equation (6) is an implicit time-accurate equation for the time advancement of the unsteady solution 
of the Navier-Stokes equations. The first task is to put equation (6) in a form that is amenable to time- 
asymptotic steady-state methods such as TLNS3D. This involves construction of an iteration procedure 
that can be interpreted as a pseudo time. The TLNS3D code employs a Runge-Kutta and multigrid 
methodology to advance the pseudo time, which introduces an additional level of iteration. To avoid the 
use of multiple indices and, hopefully, avoid confusion between the various iteration procedures within 
the overall algorithm, the following change of variables is introduced. Let W — U n+l and IV 1 denote 
the Zth approximation to W. Thus, lim W ! is equal to U" * 1 whenever the iterative method used for the 

/— *■ OO 

solution of equation (6) is convergent. In describing the Runge-Kutta scheme, let V k denote the solution 
obtained in the £th stage of the Runge-Kutta scheme. 

Equation (6) is rewritten in this notation to obtain 


Cl 0 

AT 


W + 


E 

AT 


= S(W) 


(7) 


where E again involves the portion of the physical time derivative at previous time steps and is invariant 
during the iteration process which advances the solution from T n to T" +1 . An iterative equation is 
constructed from equation (6) simply by adding a pseudo-time derivative term to the left-hand side. 
The only consideration is that the sign of the new term must be the same as that of the physical time 
derivative. 


W T + 


do 

AT 


w + jL = s(w) 


(8) 
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When A ee ciqAt/AT is small, equation (8) differs from equation (2) by only a small perturbation. 
Thus, we can reasonably expect that any method that efficiently solves equation (2) would also solve 
equation (8). 

In the previous work of Jameson [4], all contributions from the physical time term were carried to 
the right-hand side of the equation and treated as explicit terms within the Runge-Kutta stages. With 
efficiency as a consideration, Jameson suggested the use of this approach only when the CFL, based on 
the physical time step, is greater than 200. (Note that a large CFL corresponds _to a small A). A later 
section shows that this explicit approach is actually unstable for large values of A. 

In many flows, especially high Reynolds number viscous flows, A may easily become large. In the 
present work, the Runge-Kutta scheme is made stable for all values of A by treating the contribution 
of the physical time derivative implicitly within the Runge-Kutta scheme. Because the time derivative 
appears only as a diagonal term, the modification is easily implemented as follows. 

V° = W l 

(1 + a k X) V k = V Q + a k A rR^V) k = 1, 2, 3, ...K (9) 

w l+1 = V K 

where R(V) = S(V) - E/AT denotes the modified residual, and the superscript k - 1 denotes that the 
residual may be a combination of R{V) at all the previous Runge-Kutta stages. 

This formulation is not yet appropriate for steady-state flow solvers such as TLNS3D. The reason is 
that these codes use several acceleration techniques, such as implicit-residual smoothing and multigrid, 
both of which are designed to operate on a residual term that goes to 0 as the solution converges. Because 
R contains only the portion of the physical time derivative at previous physical time steps, it converges 
to XW as W 1 goes to W. To accommodate the above acceleration techniques, R. is rewritten as 

A tR(V) = XV - XV + A tR,(V) = XV + AtR,(V) (10) 


where 

R{V) = S(V) - {a 0 V + E)/AT 0 1 ) 

The residual R. contains all the physical terms and goes to 0 as W l goes to W. The Runge-Kutta method, 
with implicit-residual smoothing and multigrid, becomes 

V° = W l ( 1 2a) 

(1 + a t \)V k = V" + ak JV~ k + ajAri- 1 . [^(V) + / ] k = 1,2,3 ,. ..K (12b) 

W l+1 = V K (12c) 


where Li TS denotes the implicit-residual smoothing operator, and / denotes the multigrid forcing function 
(which is zero on the finest grid). 

The usual coarse-grid equation that would result from applying multigrid to equation (8) is 


W? h) = R {2h) (W) + [i ^R^(W^ - R (2h) (\\^Uv (h) )' 


(13) 


where the superscript in parentheses denotes the multigrid grid level at which the operator or variable is 
defined (h on the fine grid, 2 h on the next coarse grid, etc.), and the operator I ( ( ^ l) denotes the restriction 
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process from the fine grid ( h ) to the coarse grid (2 k). Because E is invariant during the multigrid cycle, 
its influence in the operator R ^ 2h > cancels on the coarse grid. Consequently, the coarse-grid residual 
can be rewritten to exclude E, which eliminates the need to restrict and store this term. The actual 
coarse-grid equation is now 


wi 2h) = R ( 2 h) (W) + l[yp(^ (4) ) - R {2h) (l[y = R { 2 h \W) + / (2,i ) ( 14 ) 


where 

5 (m) = f m) _ ( m ) == h 

\S(W)-XW (m) = 2h, 4h, ... 


(15) 


STABILITY ANALYSIS 


Two stability issues are associated with the iterative-implicit method. The first issue is the stability 
of the implicit equation that contains all the physics (equation (6)). The second issue is the stability and 
convergence of the iterative algorithm that is used to invert the implicit equation. The second issue can 
easily be studied independently of the first. The first issue can be studied independently of the second 
by assuming that the implicit equation can be solved exactly (i.e., the iterative procedure is convergent). 
The following analysis concentrates on the stability of the Runge-Kutta/multigrid method that is used to 
solve the implicit equation; however, the section is concluded with a few comments that pertain to the 
stability of equation (6). 

Fourier stability analysis is used to illustrate the effect of treating the physical time derivative implicitly 
instead of explicitly, as well as to illustrate other algorithmic choices. The analysis is performed for the 
Runge-Kutta method given by equation (12) with a scalar model equation of the form 


dw 

dr 


+ \W = a 



^4 / * \3 

32 (AX) 


dHV 

dx 4 


) 


= S c + Sd 


(16) 


The fourth derivative and its scaling closely model the numerical dissipation common to codes such 
as TLNS3D. Note that because the terms E and / are constant during the Runge-Kutta integration, 
they have no influence on the stability and have been dropped from the analysis. The particular 
version of Runge-Kutta used by TLNS3D and for this analysis is a five-stage method defined by 
{o! k } = (1/4, 1/6, 3/8, 1/2, 1 } . The convection terms are commonly trea ted differently from the dissipation 
with regard to the definition of the k — 1 index. In addition, the V k ~ 1 terms, which appear twice on the 
right-hand side of equation (12b), may be treated differently in each instance. In this stability analysis, 
the convective and dissipative terms are treated exactly as TLNS3D treats them in a steady-state case. 
These algorithmic choices are denoted by rewriting (12b) as 

(l+7 a k \)V k = V° + 'ya^\V k ~ 1 + a k ^rLj r \ • fs*" 1 + Sj -1 - 

sp = AfcSj- 1 - (1 - fik)Sp (17) 

5 ,° = 0 
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where {/?*} = {1.0, 0.0, 0.56, 0.0, 0.44}. The new parameter 7 allows both implicit (7 = 1) and^explicit 

(j = 0) treatments of the physical time derivative to be studied. The choice of V k ~ 1 and V k ~ 1 is the 
subject of this analysis; however, the analysis will focus on forms that are easily implemented within the 

current version of TLNS3D. To this end, both V k ~ 1 and V k ~ l are defined in a manner similar to the 
dissipative terms with the use of a k and a k , respectively. 

An evaluation of the spatial derivative with central-difference operators and the transformation to 
Fourier space gives 


G* 


1 + ja k X G k ~ l 


[j'Asin(^)G* : 1 — sin 4 ($/2)G l j ) 1 — A G£ 1 
+ 1 +C 2 sin 2 (fl/2) 

(l + 7a fc A) 


(k = 1,2, 3, 4, 5) (18) 


where 


G k f l =t3 k G k ~ l - (1 -(3 k )G k ~ 2 
G k - l =a k G k ~ l - (1 -a k )G k ~ 2 
Gj- 1 = a k G k ~ l - (1 — a k )G k ~ 2 
G° = Gg = G° a = G° = 1 


(19) 


GI 1 = G - 1 = G; 1 = 0 


T p 

£2 is the coefficient of implicit-residual smoothing, and A = oAr/Ax is the CFL number based on the 
pseudo-time step. 

The influence of implicit versus explicit treatment of the physical time derivative is best illustrated 
by considering a simplified case of equation (18). Consider the case in which £2 = 0 and cx k = < 7 k = (3 k , 
for which equation (18) becomes 


G k = 


1 + a fc jiA sin ( 6 )G k 1 - [A sin 4 (0/2) + A (1 - 7)]G^ fc 1 j 

(l + ia k \) 


(k = 1, 2, 3, 4, 5) (20) 


Equation (20) clearly shows that an explicit treatment of the physical time derivative (7 = 0) simply 
translates the stability region to the right as A increases from 0; an implicit treatment reduces the 
amplification factor, which expands the stability region. This difference is illustrated in figures la— c, 
which show equally spaced contours of the amplification factor ||G^|| as a function of the real (dissipative) 
and imaginary (convective) parts of the spatial operator Z{9) = A [z sin (0) — ^£4 sin 4 (9/ 2)] . Values of the 
contour lines are indicated by line types as indicated in the figure legend. Figure la shows the steady-state 
case A = 0 as a point of reference. Figures lb and lc show the explicit and implicit cases, respectively, 
for A = 1. Each figure also shows the locus of the spatial operator for A = 3. Other choices for £ 2 , a k , 
and a k give qualitatively similar results. An explicit treatment of the physical time derivative will always 
become unstable for sufficiently large values of A; the implicit approach is stable for all values of A. 

By plotting the amplification along the locus (figure 2), an unusual property of equation (16) is 
revealed. For A = 0, the amplification goes to 1 as 9 goes to 0, which is often considered a consistency 
condition. However, for A ^ 0, the solution is damped across the entire spectrum. This property is 
a consequence of the source term A W, which appears in both equations (8) and (16) and is not caused 
by an inconsistency in the derivative operators. The acceleration technique known as enthalpy damping 
makes use of the same property to improve the convergence of inviscid flows. This analysis suggests 
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that an implicit treatment of enthalpy damping may lead to further improvements; however, verification 
of this possibility is beyond the scope of this work. 

Alternative choices for a k and a k are considered with the implicit-residual smoothing reinstated with a 
coefficient of 0.25. The parameters are tuned to obtain good high-frequency damping, but also to obtain a 
scheme for which ||G 5 ||((?) > ||G' 5 j|(26 l ) whenever possible. The latter criterion will reduce the influence 
that aliasing error may have on the coarse-grid equation. Other obvious choices for a k and a k (in addition 
to a k = <7 k - Pk used above) are (1.0, 0.0, 0.0, 0.0, 0.0]^= 50 and {1.0, 1.0, 1.0, 1.0, 1.0} = 51. The 

first choice corresponds to the case in which V k ~ l or V k ~ l = U°; the second corresponds to the case in 

which V k ~ l or V k ~ l = V k ~ l . Figure 3 shows the amplification for several implicit schemes, where, for 
example, {51,/?*} denotes that {a*} = 51 and {<J k } — {Ac}- Although the case with {51, /?*} has the 
lowest overall amplification, the case with {51, 51} is more monotone in 9 and, as such, is the preferred 
method. Figure 4 shows the amplification of the latter case for a range of A. 

So far, the discussion on stability has been limited to the behavior of the Runge-Kutta used to solve 
equation (6), which does not imply that equation (6) is stable. Equation (6) falls into the class of multistep 
schemes for which the usual notion of absolute stability is not sufficient to ensure convergence. Instead, 
the scheme must satisfy the more stringent conditions of relative stability [13] to ensure convergence to 
the proper solution. Equation (6) can be shown to be unconditionally stable when the time operator is 
approximated to either first or second order. Similarly, time operators of the form given in equation (6) for 
which M is also the order of the operator are unconditionally unstable for M > 5. Although conditionally 
stable methods would not normally be considered appropriate for large AT calculations, the nature of the 
instability is such that even the conditionally stable methods are useful in many situations. A detailed 
study is beyond the scope of this work; however, the interested reader is referred to the cited reference. 


NUMERICAL RESULTS 


To demonstrate the capability of the present method, the results of several numerical experiments are 
given. The first case that is examined is the solution of an impulsively accelerated flat plate, which is 
also known as Stokes first problem [14]. An analytic solution for incompressible flow is available for 
this problem, which allows comparison with solutions obtained numerically with global minimum time 
stepping (GMTS) and the present method (with variations in time-step size and in the temporal order of 
the numerical discretization). 

The analytic solution of Stokes first problem [14] shows that the time-dependent solution collapses 
to a single solution of nondimensional velocity versus the similarity parameter r\ defined as 


V = 



( 21 ) 


where y is the direction normal to the flat plate, v is the kinematic viscosity, and T is the physical time. 
Figure 5a shows the analytic solution plotted in the similarity parameters. A solution calculated with 
GMTS after 2000 time steps is also plotted. Calculations were also performed with the present method 
with 1, 5, and 10 time steps to reach the same physical time as the GMTS solution. Different orders 
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of accuracy of the discretization of the physical time derivative were used, and the solutions are plotted 
in figures 5b-f. 

Figure 5b shows the comparison of a first-order solution for the various time steps with the GMTS 
solution. The single time step does not accurately match the GMTS. As more time steps are used (smaller 
physical time steps), the comparison improves, although all show some error from = 1*5 to 77 = 2.0. 

In figure 5c, comparisons of the present method with second-order physical time discretizations for 
the three time-step sizes are made with the GMTS. As expected, the smaller the time step, the better 
the agreement. 

Comparisons for third-, fourth-, and fifth-order discretizations are shown in figures 5d— f, respectively. 
For the third-order solution, good agreement occurs for the two smaller time steps; however, this agreement 
degrades for the fourth- and fifth-order solutions. This degradation can be attributed to starting errors 
caused by the unavailability of information at previous time steps for the first few steps of the higher 
order schemes. This study demonstrates that third-order discretization agrees with the GMTS solution 
better than second-order discretization. 

The work units required to obtain the solution of Stokes first problem at the same physical time as 
the GMTS solution after 2000 steps are shown in table 1 for a first- through a fifth-order physical time 
discretization. The single step solutions with the present method were performed in the least number 
of work units; however, the accuracy of the solution is unacceptably poor. For first- and second-order 
time discretizations, the ten-step solutions were obtained in fewer work units than the five-step solutions 
because of better convergence. (All work units in the table are based on converging the viscous drag at 
each time step to six significant digits.) Third- through fifth-order solutions required fewer work units 
for the five-step calculations than for the ten-step calculations. These results suggest that a balance exists 
between convergence speed at a given time step and the number of time steps used to obtain a solution 
with the present method. If the time step is too large, then the accuracy is poor. If the time step is too 
small, then unnecessary work is expended to converge at unneeded time steps. 

Table 1 . — Work units required to calculate solution for Stokes first problem. 


Order of Physical Time 
Discretization 

Number of Physical Time Steps f 

1 

5 

10 

1st 

54.6 

210.9 

193.8 

2nd 

51.2 

182.4 

171.0 

3rd 

42.1 

159.5 

171.0 

4th 

15.8 

165.3 

193.8 

5th 

15.8 

153.8 

171.0 

GMTS 

2000.0 1 


The second test case used to demonstrate the present method is the unsteady flow over an impulsively 
started two-dimensional circular cylinder (with a Reynolds number of 1200 and a Mach number of 0.3). 
Detailed experimental and numerical investigations of the flow behind a cylinder have been performed 
previously by other authors. (See, for example, reference [15].) The initial flow is symmetric with zero 
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lift as the wake behind the cylinder begins to grow. As the wake continues to grow, it becomes unstable 
and begins to shed from alternate sides of the cylinder. 

An examination is made of the first part of the solution where the symmetric wake begins to grow. 
In these calculations, 4600 GMTS steps were used to reach t* = 2.4. Detailed comparisons of the first- 
through fourth-order solutions with both the present method and GMTS for t* = 0 to t* = 2.4 are shown 
in figure 6. The calculations were performed with the present scheme with 10, 20, and 40 steps to reach 
t* = 2.4. As with Stokes first problem, the first-order discretization does not give satisfactory results 
(figure 6a). Second-order differencing (figure 6b) shows much better agreement with the GMTS solution. 
Third-order differencing shows good agreement for the two smaller time-step sizes (figure 6c). Fourth- 
order differencing gives bad overshoots for the 10-step calculation and instability for the 20- and 40-step 
calculations (figure 6d). 

The present scheme was then used to calculate the flow around the cylinder out to times where the 
vortex shedding occurred. Time histories of the lift coefficient C t and the drag coefficient based on 
integrated pressures C dp are shown in figure 7. From experimental data and the results of the GMTS 
calculations shown in reference [15], the period of the oscillation of C dp is known to be approximately 
4 in terms of the nondimensional time t*. To give 40 time steps per period, a time step of At* = 0.1 
was used. This time-step size is roughly equal to the time step used in the 20-step calculations shown in 
figure 6. The first-order discretization predicted a Strouhal number of 0.21. The second- and third-order 
discretizations predicted a Strouhal number of 0.24 compared with the experimentally obtained value of 
0.21. The fourth-order physical time discretization calculation diverged. 


CONCLUSIONS 


A method to accurately calculate time-accurate solutions to the unsteady Navier-Stokes equations 
has been presented. Multigrid acceleration has been successfully employed to accelerate the calculations 
of the iterative-implicit method. Run times that are one order of magnitude smaller than the run times 
required for global minimum time stepping have been demonstrated. 
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Figure 2. Amplification along 
the locus of the spatial operator. 



Figure 3. Amplification of 
implicit schemes with A = 1.0. 



Figure 4. Amplification of 
{51, 51} case for a range of A 
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(a) Analytic solution. 


(b) First order. 


(c) Second order. 
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(d) Third order. 


(e) Fourth order. 


(f) Fifth order. 


Figure 5. Velocity profiles for impulsively started flat plate. 









(c) Third order. (d) Fourth order. 


Figure 6. Initial pressure drag-coefficient history for impulsively started cylinder. 
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SUMMARY 


MGGHAT (MultiGrid Galerkin Hierarchical Adaptive Triangles) is a program for the solu- 
tion of linear second order elliptic partial differential equations in two dimensional polygonal 
domains. This program is now available for public use. It is a finite element method with linear, 
quadratic or cubic elements over triangles. The adaptive refinement via newest vertex bisection 
and the multigrid iteration are both based on a hierarchical basis formulation. Visualization is 
available at run time through an X Window display, and a posteriori through output files that can 
be used as GNUPLOT input. In this paper, we describe the methods used by MGGHAT, define 
the problem domain for which it is appropriate, illustrate use of the program, show numerical and 
graphical examples, and explain how to obtain the software. _ 

INTRODUCTION 


MGGHAT (MultiGrid Galerkin Hierarchical Adaptive Triangles) is a program for the solu- 
tion of linear second order elliptic partial differential equations in two dimensional polygonal 
domains. It solves equations of the form: 

(pu x ) x +{qu y ) y +ru =f in O 

u = g on dOj 

+ cu - g on 3Q 2 
dn 

where O is a polygonal domain in R 2 and p, q, r, f, c, and g are functions of x and y, and n is the 
unit normal direction. 

MGGHAT uses a finite element method with linear, quadratic or cubic elements over trian- 
gles The adaptive refinement via newest vertex bisection and the multigrid iteration are both 
based on a hierarchical basis formulation. Visualization is available at run time through an X 
Window display, and for post-run analysis through output files that can be used as GNUPLOI 
input. The program is now available in the public domain through mgnet and nedib. 


NUMERICAL METHOD 


The numerical method used by MGGHAT is a finite element method with adaptive 
refinement of the grid and a multigrid solution of the equations. In this section we briefly 
describe the method used. More details of the method can be found in [1], and a full description 
and analysis in [2], which is contained in the MGGHAT software package. 
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Discretization 


MGGHAT solves elliptic differential equations using the standard Galerkin finite element 
method. A triangular mesh is used over the 2D domain. The basis functions are C * continuous 
piecewise polynomials of any specified degree. Currently, the program only handles linear, qua- 
dratic and cubic polynomials, but can be modified to handle higher order polynomials by defining 
a quadrature rule of the appropriate accuracy. ~ 


Adaptive refinement 


The program provides automatic adaptive refinement of the grid to ensure the highest accu- 
racy for the number of nodes used. The refinement of triangles is performed using the newest 
vertex bisection method. This method divides pairs of triangles through the midpoint of their com- 
mon edge, which is equivalent to enhancing the approximation space by one hierarchical basis 
function (in the linear case). The error estimate, used to determine which triangles should be 
divided, is based on an estimate of the coefficient of the new hierarchical basis function. 


Solution 


The equations are solved using a hierarchical basis multigrid method. The relaxation phase 
consists of red-black Gauss-Seidel iterations on the nodal basis equations. The number of itera- 
tions can be user specified, but usually a red phase before coarse grid correction and a red and 
black phase after coarse grid correction suffices for optimal convergence rates. The grid transfers 
are a natural consequence of the transformation between the nodal and hierarchical bases, and can 
be shown to lead to a method equivalent to the "Galerkin" multigrid method in simple cases. 


MGGHAT SOFTWARE 


MGGHAT is written in standard FORTRAN 77, and is callable as a subroutine. An exam- 
ple main program for MGGHAT is shown in Figure 1. The program has been tested on 3 com- 
puter configurations: 1) a Pyramid computer using the f77 compiler under a dual port of UNIX 
SysV Release 2.0, 2) a Sun workstation using the f77 compiler under SunOS 4.1.1, and 3) an i486 
based PC using the f2c translator and gcc compiler under the Linux operating system. The pro- 
gram is easily installed with the makefile provided in the distribution, and requires only a FOR- 
TRAN compiler for the basic functionality. A C compiler is required for the UNIX dependent 
supplied timer routine (which can be replaced by the user). A C compiler and X Window 
libraries are required for the (optional) X Window graphics capability. 


Problem Definition 


■Hie differential equation, boundary conditions and domain are defined by user supplied 
subroutines. Figure 2 contains examples of these routines. The subroutine pde defines the equa- 
tion by providing the value of the functions p, q, r and / at any point (jt,y). Subroutine bcond 
contains the boundary conditions. The boundary is partitioned into a set of pieces in the initial 



triangulation. The piece containing the point (x,y) is passed to bcond through ipiece. bcond 
returns the functions c and g and sets itype to flag the boundary condition as Dirichlet or Mixed 
(including Neuman if c=0). If the true solution is known, the user can supply functions true, 
truex, and truey to obtain error calculations. The initial triangulation (coarse grid) is defined by 
the user in subroutine inittr (not shown). 


Parameters 


The user has control over the program through several parameters. 
mxvert, mxtri, mxlev, mxnode and mxtime: maximum values for the number of vertices, triangles, 
refinement levels, nodes and execution time can be used as termination criteria. 
tol: an error tolerance that can be used as a termination criterion. 

outlev : controls the amount of printed output. Can be 0 for no output, 1 for summary at the end 

of execution, 2 for summary after each program phase, 3 for detailed information, and 4 and 5 for 

debugging level output. An extraction from a level 2 output is illustrated in Figure 3. 

iorder: specifies the order (degree+1) of the piecewise polynomial basis functions. 

nul and nu2 : number of (half) red-black Gauss-Seidel iterations to perform before and after coarse 

grid correction, respectively. 

ncyc: number of multigrid cycles to perform in each solution phase. 

unifrm: a logical variable to indicate a uniform refinement should be used rather than adaptive 
refinement. 


Graphics 


Graphics support is provided in two forms: run time graphics on an X Window display, 
and output files suitable for input to GNUPLOT. The run time graphics use a small set of rou- 
tines which call on the X Window graphics library. The user can expand this to support other 
graphics devices by writing equivalent routines (draw a point, draw a line, print some text, etc.) 
for the desired device. There are nine forms of run time graphics: 

1) contour plot of computed solution with triangulation 

2) contour plot of true solution with triangulation 

3) contour plot of error with triangulation 

4) color plot of computed solution 

5) color plot of true solution 

6) color plot of error 

7) triangulation 

8) graph of number of nodes vs. relative error in energy noim (or error estimate) 

9) contour plot of both computed solution and true solution 

Either one or two of these forms can be displayed during one run. When two are 
displayed, additional numerical information is printed on the display, including grid size informa- 
tion, norms of the error and error estimate, and execution time. Figure 4 contains an example of 
the run time graphic displays. 

The user can select to save information in data files for later processing by GNUPLOT. 
These files contain the triangulation, computed and true solutions, and convergence data. Figures 
5 and 6 contain plots generated by GNUPLOT. 
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OBTAINING MGGHAT 


MGGHAT is now available in the public domain. It can be obtained either from mgnet or 

netlib. 


mgnet 


To obtain MGGHAT from mgnet (the multigrid network) ftp to casper.cs.yale.edu. Login 
as anonymous and use your email address as the password. Change to the mgghat directory by 
typing cd mgnet/mgghat. Then type Is to see what files are available, and get filename for each 
file you desire. To learn more about mgnet, also get the file mgnet. README from the mgnet 
directory. 


netlib 


MGGHAT can be obtained from netlib using ftp, the mail server, or xnetlib. For ftp 
retrieval, ftp to research.att.com and follow the anonymous login procedure described above. 
Look for MGGHAT in the directory netlib/pdes/mgghat. To obtain MGGHAT via email, send a 
message to netlib@oml.gov, netlib@research.att.com, or one of the other netlib servers with the 
message send index from pdes/mgghat. To leam how to obtain materials from netlib through an 
X Window interface, send the message send index from xnetlib to one of the netlib mail servers. 
For more information on netlib, send the message send index to one of the netlib mail servers. 
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program main 

include ’commons’ ! all parameters are passed through common 
c 

c set maximum allowed values based on dimensions 
c 

mxvert = ndvert 
mxtri = ndtri 
mxlev = ndlev 
mxnode = ndnode 


c set program parameters 
c 

mxtime = 12.*60.*60. 


! maximum execution time in seconds 


ioutpt = 6 
outlev = 2 
iorder = 2 
nul = 1 
nu2 = 2 
ncyc = 1 
tol = 0.001 
mgfreq = 2. 
unifrm = .false, 
igrfl = 0 
igrf2 = 0 
grflst = 0. 
grfsiz = .1 
grffinn = 0. 
grffinx = 2. 
gptri = 0 
gpsol = 0 
gpconv = 0 

call mgghat 

stop 

end 


! unit for printed output 
! amount (level) of printed output 
! polynomial order (linear in this case) 

! number of relaxation iterations before 
! and after coarse grid correction 
! number of multigrid cycles 
! error tolerance for termination 
! how often to do multigrid cycle 
! flag for uniform/adaptive grid 
! run time graphics selections (no 
! graphics in this example) 

! a value for which a contour line is drawn 
! and the spacing between contours 
! bounds for determining the color 
! map for color contour plots 
! set to 1 to save triangulation for gnuplot 
! set positive to save solution for gnuplot 
! set to 1 to save convergence info for gnuplot 

! invoke mgghat 


Figure 1. Sample main program. 
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subroutine pde(x,y,p,q,r,f) 
real x,y,p,q,r,f 
c 

c return the values of the pde coefficents at (x,y) 
c 

c -( p(x,y) * u ) -( q(x,y) * u ) + r(x,y) * u = f(x,y) 
c 

p=l. 
q=l. 
r = 0. 

f=-20.*(x**3 + y**3)) 
return 
end 
c 

subroutine bcond(x,y,ipiece,c,g,itype) 
real x,y,c,g 
integer ipiece, itype 
c 

c returns boundary condition coefficients at (x,y) 
c 

c u + c(x,y)*u = g(x,y) or u = g(x,y) 
c n 

c In this example, the b.c. is Dirichlet on piece 1, and 0 Neuman on piece 2 
c 

if (ipiece.eq.l) then 
itype = 1 
c = 0. 

g = true(x,y) 
else 

itype = 2 
c = 0. 
g = 0. 
endif 
return 
end 
c 

real function true(x,y) ! true solution of the pde 

real x,y 

true = x**5 + y**5 

return 

end 

Figure 2. Examples of subroutines to define the problem. 
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MULTIGRID GALERKJN HIERARCHICAL ADAPTIVE TRIANGLES (MGGHAT) 
Version 0.9 (March 1993) 


input parameters: 
output level 2 

polynomial order 2 

number of cycles 1 

relaxes before cgc 1 
relaxes after cgc 2 

multigrid frequency 2.00 
error tolerance 0.0E+00 
refinement adaptive 


begin initialization 

initializations complete 

time for initialization .00 

begin refinement 
refinement complete 


number of vertices 18 

number of nodes 18 

number of triangles 22 

number of levels 3 


time for refinement (this grid) .02 

time for refinement (all grids) .02 

begin solution 
solution complete 

norms of error: 

max norm at vertices 1. 2046647 IE-01 

max norm at nodes 1 .2046647 1 E-0 1 
max norm at quad pts 2.1 2660 193E-01 
continuous energy norm 3.30431342E-01 
relative energy norm 1.49259701 E-0 1 

time for solution (this grid) .01 

time for solution (all grids) .01 

Figure 3. Sample level 2 output. 
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begin error indicators 

error indicators and estimates complete 

maximum error indicator 1.991571 19E-01 
error estimate 4.55372840E-01 

effectivity index 1.3781 1637E+00 

relative error estimate 1.96938977E-01 
relative effect index 1.31 943 834E+00 


time for error estimates (this grid) .00 

time for error estimates (all grids) .00 

time for this refinement/solution step .03 

total time so far .03 


final solution complete 

maximum error at vertices 7.39555359E-02 
maximum error at nodes 7.39555359E-02 
maximum error at quad pts 1.25789344E-01 
continuous energy norm 2.70688415E-01 
maximum error indicator 1.41879827E-01 
error estimate 4.30013269E-01 

effectivity index 1.58859134E+00 

relative energy norm 1.87541485E-01 
relative effect index 1.58822513E+00 

number of vertices 32 

number of nodes 32 

number of triangles 45 

number of levels 5 

time for initializations 
time for refinement 
time for solution 
time for error estimates 
total time 

termination due to achieving maximum nodes 
execution sucessful 

Figure 3. Sample level 2 output (continued). 
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1 0 

Figure 5. gnuplot plot of triangulation and solution. 
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SUMMARY 


ultignd methods are good candidates for the resolution of the system arising in Numerical 
Huad Dynamics. However, the question is to know if those algorithms which are efficient for the 
Poisson equation on structured meshes will still apply well to the Euler and Navier-Stokes equations 
on unstructured meshes. The study of elliptic problems leads us to define the conditions where a Full 

li* Strateg ^ ^ complexity- The aim of this paper is to build a comparison between the 

elliptic theory and practical CPD problems. 

First, as an introduction, we will recall some basic definitions and theorems applied to a model 
pr ° er ^' e goal of this section is to point out the different properties that we need to produce 
an FMG algorithm with O(N) complexity. Then, we will show how we can apply this theory to the 
fluid dynamics equations such as Euler and Navier-Stokes equations. At last, we present some results 
which are 2nd-order accurate and some explanations about the behaviour of the FMG process. 


INTRODUCTION 


One first important element is the mesh independent convergence speed. Hackbush in fll for 
example, proposes a demonstration of this property. It is done in the special case of an elliptic 
problem on structured nested meshes. We want to evaluate the properties that we must keep in order 
to get the mesh independent convergence speed when we use unstructured non-embedded meshes. 

I he problem to be solved is the following: 


( — / on fi convex polygonal domain 

\ « Ion = 0 

u 6 /fj°(D) and f e L 2 (fi) 


( 1 ) 


The discretization is a usual linear Pl-Galerkin finite element. Thus, we get a discrete space H h 
whose dimension is equal to N h (number of nodes), and where the subscript h indicates the mesh 

‘Work partly supported by DRET Groupe 6 under contract. 

Supported by INRIA and “Region Provence-Alpes-Cote d’Azur” (Prance), and ICASE (USA). 
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size. The resulting problem consists now in solving the linear system: 

A h u h = f h ( 2 ) 

We may evaluate the discretization error thanks to the Aubin-Nitsche’s theorem and usual regularity: 

Htt-ttfcHtf < C 2 h 2 \\u\\ H , < C Z h 2 ||/|| tf (3) 

The hnear system (2) may incur a lot of CPU time because of its size (large number of nodes). The 
idea is then to use a second finite element subspace Ti H whose dimension N H is less than the previous 
one (usually H = 2 h). We have then the following relationship between both spaces (also called 

grids): 

a h 

M Nh » ]R”H 

R t 1 P 

n Nh — * JR Nh 

A ' 1 

A h 

where P and R are hnear interpolations (transfer operators). The iterative process can be written 
as: 

■u£ +1 = M h 11 % 4- N h fh where: M h = S 1 ^ (I — P A ^ R Ah) $ , Sh = I — uD h A h 

(S defines the basic iterative smoother, v\ and v 2 the number of pre- and post-relaxations). Such 
a process converges if ||M h || < 1. A very important property of this kind of method is that the 
convergence is independent of the mesh size. In order to simplify the notations (and the study) we 
rewrite Mh as the following ideal- 2-grid operator [2]: 

M„ = (A; 1 - P A-„ l R)(A h SI) (4) 

The norms of both factors of the right hand side of the equation (4) will determine the norm of Af*: 


The smoothing property: 


II A h S'}; II < l/h 2 r/( v) 
lim f]{y) - 0 


( 5 ) 


depends a lot on the basic smoothing process, and, we will not give any details. 
The approximation property is: 

HA^ 1 -PAjfR || = 0(h 2 ) 


( 6 ) 


Let us focus on (6): it takes into account the transfer operators, and overall, represents the 
difference that exists between the solution on the fine grid and the solution on the coarse grid. 

An MG scheme that exhibits these properties will result in a convergence speed that is independent 
of the mesh size: 

Vp 3i /(p) such that ||M h Vh — Uh|| < p\\ v h ~ «/i|| 

We may notice that demonstrating the approximation property leads to the evaluation of the following 
quantity: 

\\p h PA- H l Rr h -A- l \\ 
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For nested meshes, thanks to (3), one can easily derive this from the following equality: 

PhP = Ph 

On the other hand, for unnested meshes, Ph,P is not equal to pn and this evaluation is more difficult. 
Actually, it is the same as evaluating the difference between two interpolated solutions. Zhang [3], 
thanks to Bank-Dupont’s theory, proposes such an evaluation. 

Remark: The multigrid iterative V-cycle algorithm can easily be deduced from the previous 2-grid 
algorithm recursively. It maintains the convergence speed independently of the mesh size and has an 
0(N log N ) complexity. 

In order to apply the previous result of convergence, we propose to use a Full Multigrid (FMG) 
strategy (proposed by Brandt in [4]). A well known result is the one given by Hackbush in [1], which 
is given below: 

Theorem: We note: k the consistency order, C 2 = max the ratio of accuracy 

between two solutions, S the reduction factor of the MG process, i the number of MG iterations 
applied to reach the solution u\. If Uk is the solution of the discrete problem, we have: 

IK-u/tll < ClC^hl with C\ = 

Assuming that C 2 = 2 K , we deduce the exact number of cycles in each FMG phase to solve the 
lst-order problem (S i < 1/4) and the 2nd-order problem (S i < 1/8). The number of cycles i in 
each phase is constant, which leads to an algorithm that has 0(N ) complexity. 

Once again, the relative interpolation error [1] conditions the quality of the initialization in each 
phase. Thus, in order to stay close to the ideal scheme (where the different subspaces are nested), we 
propose to build meshes where: 

• The mesh size ratio is close to 2, 

• The triangles aspect ratio is locally comparable in the whole domain. 

We have thus identified the different necessary ingredients to build an algorithm having O(N) com- 
plexity: 


1. A sequence of grids, 

2. A basic smoother (ex: Jacobi,Gauss-Seidel), 

3. Intergrid transfer operators (ex: linear interpolations), 

4. MG algorithm (ex: V-cycle, W-cycle), 

5. FMG strategy. 

We may now apply it to more complex fluid dynamics problems such as the resolution of the 
Navier-Stokes/Euler equations. 
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FLUID DYNAMICS APPLICATIONS 


-C 


We recall first the formulation of the steady Navier-Stokes equations: 


dF(W) dG(W) _ 1 fdR(W ) &S(W)\ 

dx dy Re \ dx + dy ) 


W = (p,pu,pv,E) T , 



pu \ 


/ pv 

\ 

F(W) = 

pu 2 +p 

G(W) = 

puv 



puv 

! pv A + p 


\ ( E+p)u ) 


\ (E + p) v 

/ 


t 

0 

\ 

( 



R(W) = 


l xy 

UT X x "I" VT xv -f- 


'ynde 

X V _r p r Q x j 

P = (7 - 1) (e - l p {u 2 + n 2 )) 


S(W) = 


' xy 
T yy 


UT. 


Xy d" 


"fnde 

e = C V T = -f — \{u 2 + u 2 ) 


( 7 ) 


where 7 — 1.4 is the ratio of specific heats, T is the temperature, y, and k are the normalized viscosity 
and thermal conductivity coefficients. The components of the Cauchy stress tensor r xx , r xy and r yv 
are given by: 



Re = PqUoLq/hq is the Reynolds number and Pr = pqC p /ko is the Prandtl number, where po, Uq, Lq 
and po denote respectively the characteristic density, velocity, length and diffusivity of the considered 
flow. It is easily seen that, if the right-hand side is equal to zero, then we recover the Euler equations. 
The inlet conditions are defined by the farfield flow. For Euler flows, we impose the slip condition 
on the wall (V .ft = 0), and for Navier-Stokes flows, on the wall, the no-slip condition (V = 0) and 
the isothermal condition (T = T b ). The discretization is given by a mixed FEM/FVM formulation 
[5], where the mesh is a finite-element type (triangles), on which we construct control-cells (FVM) in 
order to solve the variational formulation of the equations, such as, for the Euler flows: 


£ / V i .P{W)da\ + f V i .F{W)da=~ V [[(r. 

J,%) J W, Re 


99? 

dx 


A 


+ S 


dy 


^ dxdy | 


( 8 ) 


The computation of the fluxes, appearing in ( 8 ), between two cells, is managed by Roe’s numerical 
flux vector splitting in the domain, and by the Steger- Warming numerical flux vector splitting for the 
farfield boundaries. The 2nd-order accurate scheme is obtained by the use of the MUSCL method 
developed by van Leer [ 6 ]. We solve the discrete equations with non-linear relaxation algorithms [ 6 ], 
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namely here the multistage Jacobi algorithm [2, 7]: 


wf 1 = Wf 

For ks = 1 to nstage 


wj ks+l) 


l D ki = 




( 0 ) 


d<& 


dWj 




- u> C fca £ [D] rj^(WT,W} to, ;« >a -) 

jya+1 __ ^(n^ta^e) 


( 9 ) 


Let us now look at the meshes. 


We start with an initial given fine mesh Fig.l.a. Finer meshes are 



a. Initial 800 nodes 



b. 3114 nodes 


c. Finest 12284 nodes 


Figure 1: NACA0012 fine meshes. 


obtained by triangle subdivision (Fig.l.b,.c). Then, we use a coarsening algorithm due to Guillard 
[8] to build coarser meshes, from the initial one. This produces a sequence of node-embedded meshes 
(Fig.2.a,.b,.c). We get 6 meshes for the NACA0012 profile, where the finest has 12284 nodes and the 
coarsest 19 nodes. This method allows us to keep the mesh size ratio close to 2, and a comparable 
local mesh aspect ratio. The intergrid transfer operators [9] are linear interpolations, concerning the 
variables and the corrections, and linear distributions, concerning the residuals. The MG algorithm 
will be the W-cycle, because it is the natural extension of the ideal-2-grid scheme. Furthermore, there 
exist several ways to obtain 2nd-order accurate solutions: 

• Mavriplis [10] uses an FMG algorithm, where lst-order accurate solutions are computed on the 
coarse levels, and 2nd-order accurate on the finest. Some experiments with our upwind schemes 
showed us that the convergence speed is hardly independent of the mesh size. 

• Hemker-Koren [11] propose to get a lst-order solution with an FMG strategy and then to 
compute a certain number of DeCV-cycles: They use the Defect Correction (DeC) algorithm 
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Figure 2: NACA0012 coarse meshes. 


[12], in order to solve the 2nd-order accurate following problem: T 2 {W) = S. It is written: 


*i( W l ) = S , 

F 1 (W N+1 ) = J r i(W N )-F 2 (W N ) + S , TV = 1,2,... 


( 10 ) 


Actually, we define the DeCV-cycling method where the lst-order problem in (10) is approxi- 
mately solved with one V-cycle. However, we do not know how many DeCV-cycles are to be 
performed and we lose the 0(N ) complexity. 

We propose here two different methods in order to obtain 2nd-order accurate solutions with an O(N) 
complexity algorithm [1], 

• FMDeCV is an FMG strategy where we use DeCV-cycles in each phase, with two Jacobi sweeps 
per level (FMDeCV-2RKl). 

• FMG2 is an FMG strategy where we use on each level of the different phases the good damping 
properties of the multistage schemes (see [13]) for smoothing directly the second order accurate 
problem. 

A result of convergence of the DeCV method is given in [14] and assures that DeCV-cycling has a 
convergence speed independent of the mesh size: 

\\DeCVu% -ul\\ < S 2 |K -ul\\ , S 2 = + SiS DeC + S DeC < 1 (H) 

where DeCVv% is the a - th iterate of the DeCV-cycling and u\ is the solution of the 2nd-order 
accurate problem. 


Remark: Desideri-Hemker in [12] show that the convergence speed So e c of the DeC process is at 
least equal to 1/2. From (11), we should use MG lst-order accurate algorithms whose convergence 
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speed S 1 (independent of the mesh size) is less than 1/3. Actually, the experiments showed us that 
Sj > 1/3 does not induce difficulty. 


SECOND ORDER ACCURATE RESULTS 
Euler Flows around a NACA0012 profile 



0 20 40 60 80 100 120 140 






contours interval. W 

Min./ Max. Mach 0.83693, 0.9726.- 

c. 19 nodes 

Figure 3: Euler FMG2 phases, isomach contours, M x = 0.9, a = 0°. 

The test-case depicted in Fig.3 to Fig.5 is defined by a farfield Mach number equal to 0.9, and 
a zero angle of attack. The smoother is the (4 stage) RKJ one, whose coefficients (aj = 0.14, a 2 - 
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Contours interval: 0.50000E-01 

Min./ Max. Mach 0.22627, 1.4630 


a. 800 nodes 



Contours interval: 0.50000E-01 

Min./ Max. Mach 0.12322, 1.4793 


b. 3114 nodes 



Contours interval: 0.50000E-01 

Min./Max. Mach 0.45E-01, 1.5196 


c. 12284 nodes 


Figure 4: Euler FMG2 phases, isomach contours, = 0.9, a = 0°. 


0.2939,0-3 = 0.5252, o 4 = 1) are due to van Leer [15], and defines the FMG2 strategy. In Fig.3.a, 
we present the convergence histories of the logarithm of the residual versus the number of cycles in 
each phase. The convergence is estimated as obtained when the residual decrease reaches 10 -6 . The 
first history is a 1-grid convergence on the coarsest mesh (19 nodes), the last (4-grid scheme) on the 
800 node mesh. At the end of the convergence of each phase we produce an initialization of the 
next phase by interpolating the solution on the next finer mesh. We may notice that the different 
convergence histories tend to be straight lines with the same value of slope: this allows us to say that 
the convergence speed is independent of the mesh-size and that we may use an FMG strategy. In 
Fig.3.b the residual convergence histories of the last 5 phases are depicted (the first one is a 1-grid 
convergence history with a residual decrease equal to 10 -6 ). In order to neglect oscillatory non-linear 
phenomena we choose to impose a residual decrease equal to 1/8 (and not a defined number of cycles): 
the FMG convergence histories follow exactly the peaks of the (corresponding) phases on Fig. 3. a, and, 
the solutions are not either changed. A solution on the finest grid (Fig.l.c) is reached after only 5 
cycles (77 WU, 673 s on Convex C210 with non vectorized software). In Fig.3 and Fig.4 we show the 
different solutions obtained at the end of each phase: they are non-symmetrical solutions (Fig.3), due 
to the fact that the coarse meshes are non-symmetrical. However, this phenomenon vanishes when 
the different meshes become symmetrical (Fig.4). Another important remark is that the solution 
between the finest mesh and the next coarser one does not vary much: we may say that we get a 
nearly converged solution on the 3114 node mesh. In order to verify the previous assumption, we 
compare the FMG solution of Fig. 5 with the solution obtained after a 10~® residual decrease: the 
Mach number extrema are approximately the same although the isomach fines are a little bit more 
oscillatory for the FMG solution. 
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Figure 5: Euler FMG2 solutions, isomach contours, = 0.9, a — 0°. 
Navier-Stokes Flows around a NACA0012 profile 


The first test-case is defined by a farfield Mach number equal to 0.8, an angle of attack equal to 10 
degrees and a Reynolds number equal to 73. We use here an FMDeCV-2RKl strategy (justified by 
the mesh independent convergence of Fig.6.a). The solution (Fig.6.c) is obtained on the finest mesh 
after 4 cycles that represent 42 WU computation and a total CPU time equal to 537 s. In Fig.6.d we 
illustrate the behavior of the pressure lift, CL (drag, CD) coefficient with a solid (dash) fine, versus 
the number of finest-grid iterations, up to a residual decrease on the finest grid of 10 . The points 

(cross) represent these coefficients during the FMG process, thus up to a 0.125 residual decrease. We 
can notice that the value of each of them is almost obtained at the end of the FMG phase (the error 
is equal to 64.10- 5 for the CL coefficient and to 16.10” 5 for the CD coefficient). 

The second test-case, presented in Fig.7, is defined by a farfield Mach number equal to 2, an angle 
of attack equal to 10 degrees, and a Reynolds number equal to 106. This time we use an FMG 
strategy (more robust than FMDeCV-2RKl that needs TVD limitation and implies that the residual 
stalls from the value of 10~ 3 ); this produces a solution after 7 cycles (Fig.7.c, 109 WU, 1862 s), an 
we can make the same remarks as in the previous test-cases. The next coarser mesh results in CL 
and CD values within 1% of their final values which are obtained after 7 cycles on the finest mesh 
with a related error respectively of 2.10 -6 and 2.10 -5 (Fig.7.d), 

The last test-case shows us one limitation of this method. It is defined by a farfield Mach number 
equal to 0.8, an angle of attack equal to 10 degrees and a Reynolds number equal to 500. Here again, 
we use an FMDeCV-2RKl strategy (Fig.8). We may note, once again, that the FMG convergence 
histories (Fig.8.b) look like the corresponding peaks on Fig.8.a, thus we think that the solution does 
not vary between a 10" 1 residual decrease and a 10" 6 one. However, on Fig.9.a we note that the 
isomach lines are deformed up until the last solution (Fig.9.c) which presents two bumps. Actually, 
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Min./ Max. Mach 0. , 6.79407 
c. 19 nodes 



Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1.04% 


d. 67 nodes 



Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1.0901 

e. 223 nodes 


Figure 8: Navier-Stokes FMDeCV-2RKl phases, isomach contours, M x . = 0.8, a = 10°, Re = 500. 


460 










Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1.3009 

a. 800 nodes 


Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1.2843 

b. 3114 nodes 


Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1 .2028 

c. 12284 nodes 


Figure 9: Navier-Stokes FMDeCV-2RKl phases, isomach contours, = 0.8, a - 10°, Re - 500. 


IL 


g.: ^gVERGENCE — 


Cg: ^(^{VERGENCE 


Contours interval: 0.25000E-01 

Min./ Max. Mach 0. , 1. 142 1 

a. Isomach contours 


Number of Finest Grid Iterations 


HOI 


KOI 


b. CL and CD behaviors 


Figure 10: Navier-Stokes FMDeCV-2RKl solution, A4, x — 0.8, a — 10°, Re — 500 









since the solution differs a lot between the one obtained on the finest mesh and the other on the 
next coarser mesh (Fig.9.c and .b), we may say that we did not obtain a grid-converged solution. 
Furthermore, these two humps may be justified because the meshes that we use (especially the coarse 
ones) are not adapted for such a viscous flow and may not capture the boundary layer. Our assumption 
is confirmed by the CL and CD behaviors (Fig.lO.b), where we note an error for the CL coefficient 
equal to 5.10 2 and for the CD coefficient equal to 8.10 -3 . Moreover, we get a solution (Fig. 10. a) 
after 105 cycles (1107 WU, 14112 s), for a residual decrease of 10 12 and which tends to confirm that 
the FMG solution is not a steady solution. 



Figure 11: Convergences 
CONCLUSION 


We want to point out that the use of FMG2 or FMDeCV strategy allowed us to get 2nd-order 
accurate solutions in most of cases with a limited number of operations (O (N) complexity) . FMDeCV- 
2RK1 is more adapted to smooth problems and costs half as much as FMG2. Furthermore, we had 
to use an entropy correction technique [16] to get the above results, and occasionally a 10~ 12 residual 
decrease required to increase the value of this correction to prevent the residual from stalling or to 
improve robustness on the finest grid (this increased slightly the number of cycles in each phase 
without changing the solution greatly). As depicted in Fig. 11, these types of computations do not 
induce any stall during the convergence. The main two difficulties, non-embedded meshes and the 
requirement of 2nd-order accuracy, were remedied, respectively, by using a coarsening algorithm based 
on a Voronoi' technique, and basic iteration techniques that were sufficient smoothers. The difficulty 
encountered in using an FMG strategy, with our meshes, increased as the Reynolds number was raised. 
Actually, it is obvious that our meshes are not adapted to these computations and that boundary 
layer problems will need the production of stretched meshes and different specialized smoothers, as 
suggested in [14]. 
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SUMMARY 


In the present study, a scheme capable of solving very fast and robust 
complex nonlinear systems of equations is presented. The Block Adaptive 
Multigrid (BAM) solution method offers multigrid acceleration and adaptive 
grid refinement based on the prediction of the solution error. The proposed 
solution method was used with an implicit upwind Euler solver for the solution 
of complex transonic flows around airfoils. Very fast results were obtained 
(18-fold acceleration of the solution) using one fourth of the volumes of a 
global grid with the same solution accuracy for two test cases. 


INTRODUCTION 


Although multigrid methods were introduced as grid adaptation techniques 
they have been established only as fast and efficient solvers for large scale 
computational problems. Up until today only a few adaptive multigrid schemes have 
been presented, e.g. the Multilevel Adaptive Technique MLAT (ref. 1), the Fast 
Adaptive Composite grid method FAC (ref. 2) and others (ref. 3), but the domain 
of applications has been mainly restricted to the solution of elliptic type 
equations. Regarding the development of adaptive schemes for hyperbolic 
systems of equations, only a few attempts have been made to take advantage of the 
favourable multigrid concept for the acceleration of the solution. On the 
other hand great advantages have been pointed out for the use of the 

truncation error prediction as a reliable error sensor for grid adaptation, 
though few studies exhibited numerical proofs (ref. 4,5). 

The present study, a dynamically grid adaptive method, namely the Block 
Adaptive Multigrid (BAM) method, is presented, incorporating a reliable device 

for the prediction of the error and a composite multigrid solver. The method 

is based on a Full Multigrid scheme: starting from an acceptable coarse mesh 
the solution creates finer grid patches with block grid refinement. The 
refined regions are grouped into rectangular blocks defining a composite 
structure which is totally handled by the multigrid method. In this way an 

adapted non uniform domain is decomposed into regular subdomains solved with 
common solvers. Further, the communication between blocks and the 

1 This work was partially supported by CEC/ Brite-Euram contract AERO-0018C. 
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parallelization of the BAM method are based on domain decomposition ideas. For 
the integration and relaxation of the time marching Euler equations, an 
unfactored implicit upwind finite volume scheme is employed. The proposed 
method is tested for two complicated transonic inviscid cases for two 
different airfoils. Using the proposed method stable and accurate results are 
obtained in a small number of work units (18-fold acceleration with 4.5 times 
fewer volumes). 


FINITE VOLUME DISCRETIZATION 


The general inviscid flow is described by the Euler equations that can be 
solved using the very popular time-marching conservative formulation. For the 
two dimensional case, conservation laws are used with body- fitted coordinates 
T 


3U , 3E , 3F 
3t + ~d%~ + 3rj 


0 


( 2 . 1 ) 


The steady state solution is found when the time derivative of the solution 
vector becomes negligible. The solution vector and the fluxes normal to 
fj= const, and q = const, faces are given respectively: 


U = J • U and E = J • (E £ x +F ? y ) , F = J • (E + Fi^) (2.2) 

At the Cartesian coordinate system the corresponding solution vector and the 
inviscid fluxes are given by: 



re 1 


' e u 2 


pv 

u = 

pu 

pv 

e 

and E = 

pu +p 

QU V 

. (e+p)u . 

, F = 

pilY 
pv +p 
(e+p)v _ 


In equation (2.3) e is the total energy (e = + ~y~Q[ u2+v2 ])> P and p are 

the pressure and the density respectively and J is the Jacobian of the inverse 
mapping. 

For the discretisation of equation (2.1) a cell-centered finite volume 
method is used. For the pseudo-time evolution a Newton linearization scheme is 
adopted which, being an implicit scheme, allows high CFL numbers (100-200) to 
be used with local time stepping (different At for each volume): 

-£r + (A" AU ) ? + (B n AU ^ = - (E"+ Fj) (2.4) 

Or else: 

L n * AU = ^At' * + A^ + B^j • AU = - Res n (U n ) (2.5) 


Where AU is the correction of the solution vector, A and B are the Jacobians 


( 2 . 6 ) 


of the fluxes E and F respectively for the time level n: 


U n + 1 = U n + AU 



B n = 



n 


Upwind differencing of the flux vectors is used to achieve a diagonal 
dominant system of equations. Thus, using a symmetric collective point Gauss- 

Seidel relaxation scheme very good smoothing properties are attained for the 

multigrid calculations. For the flux calculations a linear locally one- 

dimensional Riemann solver (Godunov- type) is employed thus, the homogeneous 
property of the Euler fluxes [7] is guaranteed. The mean values of the 

conservative variables at both sides of the faces are used as input variables 
for the Riemann solver. For the calculation of the fluxes E and F the 

conservative variables are extrapolated using an upwind characteristic 

variable interpolation method (MUSCL- type). The interpolation scheme uses two 
volumes from both sides of each face. The accuracy of the scheme raises up to 
third order depending on the sign of the local eigenvalues of the Jacobians A 
and B. The local accuracy of the finite volume method is sensor- controlled so 
the monotonic behaviour of the solution is guaranteed. Boundary conditions are 
required for both sides of eq.(2.4), so for the RHS of eq.(2.4) 
characteristic boundary conditions are extracted from the Riemann solutions at 
the wall and at free surfaces. For the LHS of eq.(2.4) simple boundary 
conditions are prescribed on the AU in phantom shells. 


THE BLOCK ADAPTIVE MULTIGRID METHOD 


The Block Adaptive Multigrid (BAM) method is composed of three main 
parts: the fast nonlinear multigrid solver, the truncation error approximation 
for the prediction of the solution error and the block composite grid solver. 
Additionally an efficient solution strategy is required in order to achieve 
fast and robust accurate solutions. 


Multigrid Implementation 


Concerning the multigrid method the Full Approximation Scheme (FAS) is 
employed as it is better than the Correction Scheme (CS) for Newton 

linearisation schemes. FAS has the major advantage in that it operates with the 
same solution vector of the initial algorithm so it is best suited for the 

solution of composite grid structures. The efficiency and the performance of 

the multigrid implementation are maintained adopting the "alternate point of 

view" of the FAS (ref. 1). Following this approach the finer grid levels are 

considered as devices to increase the spatial accuracy of the solution whereas 

the coarser grid levels are devices to accelerate the solution. The 

formulation of the solution is independent of the grid level (coarse or fine) 

and the type of grid (local or global) by simply adding to the RHS of eq.(2.5) 
the appropriate fine-to-coarse defect correction (t). For the enumeration of 

the multigrid levels the classical mode is adopted. Thus, the current grid 



level is denoted by n, its next finer level by n + 1 and its corresponding 
finest grid level by m (m a n t 1) (1 is the coarsest grid level of the 
domain). The solution formulation for the current grid level n is given as: 


L • AU - - Res (U ) + t n+1 

n n n v ir n 


( 3 . 1 ) 


considering that the fine-to-coarse defect correction is: 

x n + l= S n + i r L • AU 1 - L • (I n+1 AU .) and 
n n [_ n + 1 n+l J n v n n+r 

Because one multigrid cycle equals one time step 
considered. 


t " 1 ; 1 - 0 ( 3 . 2 ) 

the time scale is not 


Because of the applied cell centered finite volume scheme the cellwise 
coarsening is adopted; each coarse volume is constructed by four consequently 
finer ones. The cellwise coarsening maintains the outer edges of the volumes 
(straight implementation of conservation laws) but the coarse grid centers are 
not a subset of the fine ones. Therefore, two different restriction operators 
are required. The restriction operator (I) for the physical variables is the 
simple average over the four fine volumes: 


AU 

n 


I n+1 AU 


n+l 


AUn+1 
1 


( 3 . 3 ) 


In contrast, the restriction operator (£) for the generalized residuals, Res 
and x, is the summation of the residuals of the corresponding fine volumes. The 
fluxes of the inner common fine grid faces are canceled, thus flux 
conservation at the coarser grid levels is maintained: 


Res - Z n + 1 Res , 
n n n+l 



( 3 . 4 ) 


For the reverse direction of the multigrid cycle (coarse-to- fine direction) 
neither Euler solutions nor relaxation sweeps are required. Therefore, AU 
variables are stored for all grid levels and only these are prolongated from 
the coarse-to-fine direction using the standard FAS prolongation formulation: 


AU 


n+ 1 


AU 1+ II ", (AU - I n + 1 AU ,) 

n+l n + l v n n n+l' 


( 3 . 5 ) 


For the prolongation operator (II) simple injection is adopted, i.e.: 


U = II ", U = U 

n+l n+l n n 


( 3 . 6 ) 


Due to the composite grid structure, complicated interpolation schemes would 
have increased considerably the programming complexity with minor advantages. 
For the present multigrid implementation the V-cycle is applied with the 
improvement of increasing the number of relaxation sweeps as coarser levels 
are processed. 


Solution Error Approximation 


To determine the erroneous regions of the computational domain where 


increase of the accuracy is required, a reliable error indicator is required. 
Two approaches essentially exist: the first one is the physically based 
information of the problem, i.e. solution gradients, while the second one, 
numerically motivated, is the evaluation of the discretization error. The 
former approach may implicate the refinement process to reduce errors that 
have no influence on the global solution. In contrast, the evaluation of the 
truncation error approach indicates errors which can be confronted .by the 
refinement procedure. On the basis of the Richardson extrapolation an 
estimation of the truncation error is formulated as: 

T n = Q(h)-u (3-6) 

Whereas for a uniform Cartesian mesh holds: 

T = co(h) p (3-7) 

n 

In equations (3.6), (3.7) Q is the differential operator, u is the physically 
correct solution, h is the mesh size of the finest grid level, p is the local 
order of accuracy and c is an unknown grid independent factor. Guided by the 
physical interpretation of the truncation error and the Richardson 

extrapolation concept, the difference of residuals between the two finest grid 
levels is used for the truncation error calculation. This error sensor 

provides a reliable local estimation of the solution error. Although the 

fine-to-coarse defect correction of eq.(3.2) is directly provided by the 
multigrid solution it was proved to be an unreliable error indicator. 
Fortunately the use of the Res(U) operator instead of the L<>AU operator gives 
very good results. The explanation is that due to the Newton linearization 
scheme the Res(U) differential operator is insensitive to relaxation errors 

maintaining the accuracy of the solution. Therefore, the solution error 

evaluation for the grid level m-1 (m is the current local finest grid level), 

is given by: 


T m =Z m T 
m-1 m-1 


- T i 

m m-1 


= z 


m 


m Res (U ) - Res ,(I m .U ) 
- 1 nr m-l v m-1 nr 


For a totally converged solution equation (3.8) reduces to: 

T m = - Res ,(I "U ) 

m i m-1 ' m-1 rrr 


(3.8) 

(3.9) 


As this truncation error sensor is a vector, a reduction norm to a .single 
value is required. At the present study the Euclidean norm has been adopted 
because it shows similar distribution to the pressure error of the solution. 
The proposed error sensor requires additional work of only one fourth of a 
simple flux calculation and it does not demand totally converged solution 
(Res (U )*0) although it converges from the early time steps to the steady 


state. 


The prediction of the solution error for the new finer grid level m-1 can 
be computed only if the assumption of a uniform Cartesian mesh is adopted 

fii = h /2). Thus following equation (3.7) we get: 
v m+l m' 

T m + 1 = 2' p T m (3-10) 

m m-1 

For the determination of the order of accuracy (p) of eq.(3.10), which 
varies from one to two depending on the flow features, the sensing functions 
that are used to control monotonicity at the integration routine are also used 
to calculate the local accuracy of the scheme. Unfortunately, equation (3.10) 
holds only for Cartesian grids, thus using this equation in arbitrary grids 


errors are expected to the predicted error levels. 


Composite Grid Structure and Solution Procedure 


For the optimal approximation of the most accurate solution within the 
minimum amount of work, several grid adaptive techniques and structures have 
been developed. The composite grid structure has many advantages by enabling 
the solution of a globally non uniform grid using a union of locally uniform 
subgrids (blocks) (ref. 2). Subgrid uniformity is essential to assure 
multigrid efficiency using simple solution routines (similar to the single 
grid solver). Moreover, significant simplifications to the data structure and 
to the interface manipulation are attained when the blocks are restricted to 
rectangles in the computational domain and the grid refinement ratio is 2 for 
each refined direction (ref. 5). 

In a composite multigrid method the major problem is the requirement of 
special manipulation at the artificial boundaries to maintain the accuracy of 
the global solution. To suppress errors inconsistent with the solution method 
at the artificial boundaries (two fine volumes share the same edge with a 
coarse one) certain special boundary conditions must hold. Namely: accuracy 
preservation of the integration scheme, flux conservation and flux- splitting 
compatibility with respect to the global grid solver. To preserve the accuracy 
of the solution across an artificial boundary, the fluxes of the domain should 
be calculated at the block’s finest grid level. For these calculations the 
standard integration routine has been employed. The basic idea is to properly 
construct false fine volumes at the coarse grid boundaries using all the 

available information near the interfaces. The modified scheme at the 

intergrid boundaries is depicted in fig.l and fig.2 for the one and two 

dimensional analogs, respectively. Following these analogs the flux 
calculations should be started from Block 1 including the boundary interfaces. 
For the evaluation of the virtual volumes F3 and F4 (fig. 1 ) the coarse volumes 
Cl, C2 and the fine ones FI, F2 are considered adopting the same MUSCL- type 
extrapolation scheme used by the integration routine. At the two dimensional 
analog, the same couple of coarse volumes is used with the corresponding fine 
volumes (fig.2). In this way, the order of accuracy of the global solution 

scheme is maintained. After calculating the fluxes of the finest subgrid 

(Block 1 in fig.l), the boundary fluxes of the neighboring coarser subgrids 
(Block 2) can be calculated explicitly using conservation of fluxes. According 
to the multigrid restriction operator for the residuals, flux conservation 
across the artificial boundaries is achieved by addressing the summation of 
the fluxes of the fine volume faces to their adjacent coarse volume face (fig. 

With respect to the relaxation procedure, the modifications to the flux 
vector splitting and the relaxation scheme at the subgrid interfaces are 

totally handled by the multigrid algorithm. Adopting the "horizontal" 

communication mode among subgrids, the fine subgrids (i.e. block 1) are 

relaxed at the first grid level (fig.l). At the next multigrid level the 

coarse subgrids (i.e. block 2) are relaxed together with the restricted fine 

ones, while the block structure of the composite grid is preserved throughout 

the multigrid cycle. 
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Due to the block structure, the composite grid has the advantage of a 
straightforward implementation of vectorization and parallelization. This can 
be achieved when subgrids are considered totally independent from each other 

introducing ideas from the domain decomposition theory. Concerning the 

"vertical" communication mode (ref. 11), each subgrid is solved and relaxed 

independently for the complete multigrid cycle whereas data exchange among 
subgrids is permitted only at the start and/ or the end of each cycle. The 

computational domain decomposition (different from the composite block 
structure) (ref. 8) should be performed according to load balancing and 

vectorization criteria taking into consideration the hardware configuration. 
The "vertical" communication mode is preferred from the "horizontal" mode when 
we deal with parallel processing and is used even though the latter mode has 
better convergence rates. Using the "vertical" mode the idle and communication 
time among processors can be considerably decreased. 

With respect to the dynamically adaptive multigrid strategy a modified 

Full Multigrid scheme is applied, starting with a global coarse grid of 
acceptable grid resolution. After convergence (or after a fixed amount of 

work) at the current grid, the truncation error is calculated and the solution 

error is predicted. In regions where the prediction of the error is above a 
threshold the corresponding volumes are flagged and grouped into rectangular 
blocks. Afterwards, the domain is decomposed to the appropriate blocks where 

only those which contain the flagged volumes are refined to the next grid 

level injecting the coarse grid solution to the refined grid. The refinement 
procedure continues until the entire computational domain has local truncation 
errors below a given threshold. Clearly this strategy has the benefit of a 

continuous iterative procedure without wasting CPU-time on calculations that 
will not be used after the mesh refinement. Taking advantage of the most 
accurate available solution the proposed scheme converges straight to the most 

efficient solution. 


RESULTS AND DISCUSSION 


In order to verify the accuracy and to validate the efficiency of the 

proposed method two transonic inviscid test cases were investigated. The first 
case is a NACA- 0012 airfoil for Mach 0.80 at angle 1.25 degrees while the 

second case is a RAE- 2822 airfoil for 0.73 Mach at 2.79 degrees. A Work Unit 
(W.U.) is defined by the CPU-time required for a global finest grid relaxation 

sweep in lexicographic order while for a single grid running one time step 
costs four work units. 

For both test cases the grid adaptation procedure as described above was 
applied. Starting point is two global multigrid levels with 64x14 volumes at 
the finest grid level. The user supplies only the maximum number of the 
additional grid levels which for both test cases were the same: two refinement 
grid levels. The truncation error threshold was defined explicitly targeting 

to a three- fold reduction of the initial error levels. For the convergence 
criterion the Euclidean norm of the correction vector was employed. 


For the first test case (NACA-0012) a comparison between the computed and 
the predicted truncation error contours is given in figures 3 and 4, resp.. 
Some differences, not crucial, are due to the incorrect evaluation of the 



local accuracy of the solution (p). For the same grid level the actual 
pressure error contours and the entropy contours in figures 5 and 6 are 
included. A comparison between the truncation errors and the pressure errors 
shows their similar distributions. Following the proposed method very accurate 
results were obtained in comparison to the global solutions as depicted in 
fig.7. In fig.7 the Mach- distributions along the airfoil are depicted for two 
global grids (256x56 and 128x28) and one composite grid sharing both grid 
levels. A comparison of these Mach distributions shows that at regions 

with the same mesh size the solutions between a global and a local refinement 
coincide. Additionally, in fig.8 the Mach contours together with the composite 
grid are given. In fig.9 the efficiency of the BAM method with respect to the 

single grid and the global multigrid schemes is clearly indicated. A 19-fold 
and 4-fold acceleration were achieved with respect to the single grid and to 

the multigrid cases, respectively. The final grid adaptive solution requires 
4.5 times fewer volumes (from 14336 to 3200) for practically the same accuracy 
with a globally refined grid (0.35% discrepancy of the computed Cl). 

Similar efficiency was achieved for the second test case. The domain 

decomposition into subgrids took place after 50 time steps spent on coarser 
grid levels. Using the same truncation error threshold as in the previous test 

case, a 17-fold acceleration was achieved with respect to the single grid and 
a 4.7- fold reduction of the volumes (from 14336 to 3056) for practically the 
same accuracy (Cl discrepancy is 0.2 %). The convergence histories of the 
error reduction and the lift coefficient are shown in fig. 10 while in fig.ll 
the final composite grid together with the isomach contours are depicted. It 
is important that throughout the solution process the multigrid convergence 
rates were maintained while the overhead for the interface computations was 
negligible, i.e. only 2 % for a nine block structure with respect to an 

equivalent global grid. 

As no parallel machine was currently available a simulation of the 
parallelization for the procedure has been attempted in order to evaluate the 
performance of the two different communication modes for the BAM method. To 
achieve this, the data exchange was properly adjusted among subgrids according 
to each communication mode. For the "horizontal" mode (semi-parallel) and the 
"vertical" mode (parallel) the convergence histories are shown in fig. 12. A 
comparison between the parallel modes and the sequential BAM method shows only 
a small reduction of the convergence rates for the parallel ones. What is 

further required for a parallel implementation is a computational domain 
decomposition of the composite structure based on specific load balancing 

criteria in order to reduce the idle time among processors. 


CONCLUSIONS 


The great advantages of the Block Adaptive Multigrid (BAM) method were 
exhibited. The incorporation of numerous efficient schemes into the BAM method 
makes the ultimate target of solving complex problems in just a few work units 
feasible. At the same time robustness, simplicity and accuracy of the single 
grid code were maintained at the new method. The extension to viscous and 
hypersonic three dimensional problems is straight forward though semi 
coarsening multigrid can be also included. On the other hand, in order to 
improve the adaptation capabilities, a moving grid point scheme should also be 
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considered as the grid alignment towards certain flow features is essential in 
some problems in combination with the present grid refinement procedure. 

Although the basic features of the BAM method have been determined and 
verified, a few other issues remain to be settled. The first is the 
construction of a data structure which will handle more . efficiently the block 
structure of the composite grid. The second is to approximate more precisely 
the local order of accuracy of the solution in such a way that the prediction 
of the solution error would be more accurate. Additionally, the implementation 
of the BAM method to other solution algorithms and equations is foreseen as 
the BAM method was designed in the general concept of the finite volume 
method. 
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BLOCK 1 BLOCK 2 

Figure 1. One dimensional analog of the multigrid coarsening and the 
Fictitious volumes F3.F4 for the intererid flux calcultations. 
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Figure 2. Two dimensional analog of the special variable interpolation procedure at the 
integrid boundaries. 
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Figure 3. Computed truncation error contours for the finer grid level, CASE 1. 
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Figure 4. Predicted truncation error contours for the finer grid level, CASE 1. 
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Figure 7. Mach distribution along the airfoil for a global fine, 
a global coarser and an adaptive composite grid (CASE 1). 
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Figure 8. The adaptive composite grid and the corresponding Mach contours (CASE 1). 
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Figure 9* Convergence histories of the lift coefficient and the logarithmic Euclidean 
error reduction for the single grid, global multigrid and method (CASE 1). 
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Figure 10. Convergence histories of the lift coefficient and the logarithmic Euclidean 
error reduction for the single grid, global multigrid and B.A.M. method (CASE 2). 
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Figure 12. Convergence histories of the lift coefficient and the logarithmic Euclidean error 
reduction of the B.A.M. method (CASE 2) for sequential processing, “parallel” horizontal 
and “parallel” vertical schemes. 
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ABSTRACT __ 

Several multigrid schemes are considered for the numerical computation of viscous 
hypersonic flows. For each scheme, the basic solution algorithm employs upwind spatial 
discretization with explicit multistage time stepping. Two-level versions of the various 
multigrid algorithms are applied to the two-dimensional advection equation, and Fourier 
analysis is used to determine their damping properties. The capabilities of the multi- 
grid methods are assessed by solving two different hypersonic flow problems. Some new 
multigrid schemes, based on semicoarsening strategies, are shown to be quite effective in re- 
lieving the stiffness caused by the high-aspect-ratio cells required to resolve high Reynolds 
number flows. These schemes exhibit good convergence rates for Reynolds numbers up to 
200 x 10 6 . 


INTRODUCTION 

In the past several years, multigrid has been used to accelerate the convergence of 
Navier-Stokes computations for a variety of flow problems at both subsonic and transonic 
speeds (refs. 1 and 2). More recently, multigrid methods with either central or upwind 
differencing have been applied to viscous hypersonic flows to achieve convergence rates 
that approach those obtained at lower Mach numbers and moderate Reynolds numbers 
(Re < 10 7 ). However, at the higher Re values experienced by high-speed flight vehicles, 
a dramatic slowdown occurs in the convergence rate. One reason for this slowdown is the 
deterioration in the high-frequency damping of the multigrid driving scheme caused by the 
very high-aspect-ratio cells that occur in the computational mesh in order to resolve the 
thin boundary layers. 

The present paper describes an effort to understand and improve the use of multigrid 
schemes for the computation of viscous hypersonic flows. First, various two-level multigrid 
schemes both with and without semicoarsening are introduced. Then we use a Fourier 
analysis of the schemes, applied to the two-dimensional convection equation, to reveal the 
behavior of their components. For each multigrid approach, the solver uses an upwind dis- 
cretization combined with an explicit multistage scheme. We next consider the numerical 
solution of the Navier-Stokes equations for hypersonic flows. The basic elements of the 
flow solver for these equations are summarized. Some details concerning the application of 
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the time-stepping scheme to fine- and coarse-grid problems axe presented. The extension 
of the two-level schemes to multilevel ones is then discussed. Elements of multigrid that 
are of particular importance for high-speed flow computations are given. In the results 
section, we consider two different hypersonic flow problems to assess the capabilities of the 
multigrid schemes. Semicoarsening is shown to be quite effective in relaxing the stiffness 
that arises from the resolution of thin boundary layers. 

MULTIGRID METHOD AND STRATEGIES 

The multigrid approach is based on the full approximation scheme of Brandt (ref. 3). 
The grid transfer operators are those considered by Jameson (ref. 4). Coarser meshes 
axe obtained by eliminating alternate mesh points in each coordinate direction. Both the 
solution and the residuals axe restricted from fine to coarse meshes. A forcing function 
is constructed so that the solution on a coarse mesh is driven by the residuals collected 
on the next finer mesh. The corrections obtained on the coarse mesh axe interpolated 
back to the fine mesh. The multigrid schemes investigated within the present work are 
displayed in Figure 1. Figure 1(a) shows a two-level scheme with full coarsening. Re- 
striction of the solution from the fine mesh (m,n) to the coarse mesh (m/2,n/2) is done 
by injection, whereas full weighting is used for the restriction of the residuals. Prolonga- 
tion of the corrections is done by bilinear interpolation. Figure 1(b) shows a scheme with 
semicoarsening in the different coordinate directions. Again, injection and full weighting 
are used in the restriction process. The corrections obtained on the coarse meshes axe 
averaged before they are added to the current fine mesh solution which is indicated by the 
numbers at the “up” arrows. Because of this averaging, half of the individual corrections 
on the coarse meshes axe lost. We, therefore, anticipate that the scheme in Figure 1(a) 
should be computationally more efficient, provided that enough high-frequency damping 
can be obtained with the smoothing scheme of the fine mesh. In order to overcome this 
deficiency of the semicoarsening scheme, two more variants are considered. For the scheme 
of Figure 1(c), the solutions on the coarse meshes are computed sequentially. Hence, the 
corrections obtained on the (m/2,n) mesh can be used to update the (m,n/2) mesh before 
time stepping (as indicated by the horizontal arrow). The sequential update of the second 
coarse mesh allows the full corrections to be passed up to the fine mesh. Note that this 
multigrid variant is not compatible with the idea of parallel computations. An interesting 
compromise between the schemes of Figures 1(b) and 1(c) was suggested by Van Rosendale 
based on the work of ref. 5 (Figure 1(d)). Here, only the corrections com mon to both 
of the coarse meshes, (m/2,n) and (m,n/2), are averaged, whereas the corrections to the 
modes that live either on (m/2,n) or on (m,n/2) are passed to the fine mesh in full. This 
scheme does allow parallel computations for the coarse meshes. 

FOURIER ANALYSIS OF THE SCALAR ADVECTION EQUATION 

A crucial factor in constructing an effective multigrid method is the selection of a 
smoothing or driving scheme. Local mode (Fourier) analysis is generally applied to evaluate 
possible smoothers on the basis of stability and high-frequency damping properties. The 
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screening of schemes is often performed with a single-grid analysis. Since a stable single- 
grid scheme may not be stable for the multigrid process, the behavior of a smoother with 
a particular multigrid strategy is needed. In addition, the multigrid process can have 
a substantial impact on the performance of the multigrid method. In fact, as we will 
demonstrate in this paper, semicoarsening can provide significant improvement, relative to 
full coarsening, in the damping of the multigrid, especially when a strong mesh anisotropy 
is present due to the high-aspect-ratio cells. 

In ref. 4, Jameson models a multigrid scheme as a multilevel uniform scheme and 
analyzes the stability of this scheme when applied to the linear advection equation in one 
space dimension. With the multilevel uniform scheme, fine-grid and coarse-grid corrections 
are computed at all points of the fine grid. Then, a nonlinear filter is applied to remove the 
coarse-grid corrections at fine-grid points not contained in the coarse grid. The filtering 
produces additional errors in the form of a carrier wave with a frequency depending on 
the fine-mesh spacing. This approach does not allow for the coupling (aliasing) effects due 
to the restriction operator (fine to coarse grid transfer operator) in the multigrid method. 
However, it does offer the advantages of simplicity and application to more than two-level 
schemes. Thus, it allows the rapid comparison of multigrid algorithms. If a multigrid 
method is unstable or inefficient according to Fourier analysis of the multilevel uniform 
scheme, then it is probably not a reasonable scheme. 

In ref. 6 we consider the scalar two-dimensional advection equation and perform a 
Fourier analysis of the multilevel uniform scheme for different multigrid strategies. The 
effects of mesh-cell aspect ratio are included in the analysis. For details of the analysis, 
see ref. 6. Here, as in ref. 6, a five-stage scheme with three weighted evaluations of the 
numerical dissipation is used for a solver. The explicit stability limit of this scheme is 
extended with variable-coefficient implicit residual smoothing, which results in a Courant- 
Friedrichs-Lewy (CFL) number of 5. A two-level analysis is applied to both full coarsening 
and semicoarsening strategies. Figure 2 presents contours of the amplification factor g as 
a function of Fourier phase angles for the full coarsening and sequential semicoarsening 
strategies when the mesh-cell aspect ratio (AR) was set to 10. Even with this AR, one 
can clearly see the improved damping (reduced g) in the direction of the long side of the 
cell with sequential semicoaxsening. 

SPATIAL DISCRETIZATION 

A finite-volume approach, where the flow quantities are stored at the cell vertices, is 
used for the spatial discretization of the Navier- Stokes equations. For the convective flux 
calculation, an auxiliary grid is used, which is defined by connecting the cell centers of the 
original cells (see Figure 3). The inviscid numerical flux is separated into the sum of an 
averaged term that corresponds to central differencing and a dissipative term that adapts 
the discretization stencil in accordance with local wave propagation. The dissipative flux 
function is based on. the second-order-accurate upwind scheme of Yee and Harten (ref. 
7). In the case of viscous flows the entropy correction for this scheme must be carefully 
designed, as discussed in ref. 6. The physical viscous fluxes are approximated by central 
differences with a local transformation from Cartesian to curvilinear coordinates (ref. 2). 



MULTISTAGE SCHEME FOR THE FINE AND COARSE MESHES 


We have observed the need to pair spatial discretization and particular time-stepping 
schemes for the solution of the Navier-Stokes equation. The most robust choice of spatial 
discretizations found to this point is to use a second-order upwind scheme on the fine 
meshes and to set the limiter to zero everywhere on the coarse meshes. An alternative 
taken in refs. 8 and 9 is to use scalar second-difference dissipation terms on the coarse 
meshes. This approach turned out to be less robust because the second differences are 
less diffusive with respect to the acoustic modes; also, the central-difference scheme allows 
waves to travel upstream in supersonic flow. As indicated previously, a five-stage explicit 
scheme with three evaluations of dissipation is used for time advancement. Disturbances 
are most effectively expelled out of the computational domain by using local time stepping 
and implicit residual smoothing (refs. 8 and 10). The smoothing of the residuals allows the 
CFL number of the explicit scheme to be as high as 5.75, which extends the stability limit 
(CFL*) by a factor of 2.5. The time step is proportional to the ratio of the cell volume 
to the sum of the spectral radii of the inviscid flux Jacobians in the different coordinate 
directions. 

To stabilize the schemes in regions where the viscous stability limit is more restrictive 
than the inviscid limit, the coefficients of the implicit residual smoothing operator are 
locally increased, as outlined in refs. 8 and 9. At strong shocks, however, high Courant 
numbers are not appropriate. Consequently, an adaptive time step is employed. By using 
the nondimensional second difference of the pressure as a switch, the value of CFL is locally 
reduced to approximately 2 at the shock. 

MULTIGRID SCHEMES 

For the numerical solution of the Navier-Stokes equations, the two-level strategies 
presented in Figure 1 are extended to multilevel schemes, as displayed in Figure 4. The 
only differences between the two-level schemes and the multilevel schemes occur in the 
restriction process. Whenever two “down” arrows meet at a coarse mesh, averaging is 
used to obtain the restricted variable. The multilevel arrangement of the coarse meshes, 
shown in Figure 4(b), was first given by Mulder (ref. 11), who used semicoarsening to solve 
the flow alignment problem. Suitable coordinate meshes for thin boundary layers exhibit 
mostly cells with high aspect ratios in the surface- aligned direction. In this paper, other 
variants of semicoarsening, which are computationally cheaper than the semicoarsening 
schemes shown in Figure 4, are also considered for these situations. 

One may notice that the central restriction and prolongation operators discussed pre- 
viously allow for upstream propagation of disturbances in supersonic flow. Furthermore, 
the corrections given by the standard multigrid scheme near strong shocks lead to diver- 
gence of the calculation, especially when free-stream initial conditions are used. Therefore, 
the restriction operator is damped by using 

Ri,j = max (1 - 0)Rij, (1) 

where R ltJ is the standard restriction operator and is a switch to detect strong 
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shocks, and 


( 2 ) 


eij^ = max (vi,Vi+i,Vi-.i,Vj, Vj+i,Vj-i), 


Pi—l,j 2pi,j 4~ Pi+ l,j 
Pi-1, j + 2 Pi,j +Pi+l,j 


Uj = 


Pt,j — 1 2 Jljj + Pi,j+1 

Pi,j- 1 + 2pi,j + Pij+i 


( 3 ) 


where p denotes pressure. The damping coefficient fc (n) is given a value of approximately 
1 in the start-up phase of the multigrid process and is decreased to a value of about 0.4 at 
later cycle numbers to allow for good asymptotic convergence rates. Such a local damping 
with a fc (n) that does not vanish is in line with the restriction damping of Koren and 
Hemker (ref. 12), who based their damping coefficients on a more physical analysis. 

A fixed V-type cycle with time stepping only on the way down is used to execute the 
multigrid strategies described above. The robustness of the overall scheme is improved 
by smoothing the resultant coarse-mesh corrections before they are passed to the finest 
mesh. The smoothing reduces the high-frequency oscillations introduced by the linear 
interpolation of the coarse-mesh corrections. The implicit residual smoothing procedure 
with constant coefficients of around 0.1 is used for this smoothing. Also, the application 
of full multigrid (FMG) provides a well-conditioned starting solution for the finest mesh 
that is considered. 


NUMERICAL RESULTS 

Two different hypersonic flow cases are used to assess the capabilities of the multigrid 
schemes. These are laminar Mach 10 (M = 10) flow over a compression ramp and turbulent 
flow over a slender forebody at high Reynolds numbers. Table 1 gives a summary of the 
geometries and the flow parameters of the test cases. In this table, T in f is the dimensional 
free-stream temperature, and T w is the specified wall temperature. Also, the finest grid 
used for each flow computation is characterized by the streamwise and normal leading-edge 
spacings A sj e , A nj e , with the normal spacing An te at the end of the geometry. 

The flow over the compression ramp is identical to case 3.2 of the Workshop on 
Hypersonic Flows for Reentry Problems, Part II, held in Antibes, France, in 1991. This 
allows comparisons with the performance of other computational methods published in 
ref. 13. Figure 5 displays the coordinate mesh generated for this test case. The low 
Reynolds number allows for a mesh with moderate aspect ratios between 5 and 50 near 
the wall. The 129 x 81 mesh is successively coarsened down to 9 x 6, which yields 9 grid 
levels with semicoarsening and 5 levels with full coarsening. The semicoarsening strategy 
is expected to eliminate most of the stiffness associated with aspect ratio. The converged 
flow solution is shown in Figure 6 for the 129 x 81 and 65 x 41 grids. The computed extent 
of separation in the corner is somewhat smaller for the coarse mesh than for the fine mesh. 
The fine-mesh results agree well with grid-converged computations published in ref. 14. 

In the next figures, we investigate the performance of the different multigrid schemes. 
For this purpose, computations were started from a solution that was converged to about 
plotting accuracy. Results from the different schemes of Figure 4 are compared in Figure 
7. The numbers indicate the final convergence rate r of the schemes and the rate of data 
processing ( RDP ) on a CRAY-YMP to advance one grid point by one multigrid cycle. 


The sequential semicoarsening scheme (Figure 4(c)) gives by far the best convergence rate. 
For this scheme, the effect of the modifications in the multigrid strategies of Figure 4 is 
investigated in Figure 8. The meshes obtained by full coarsening and by semicoarsening in 
the direction normal to the wall are both important in achieving good convergence rates. 
From Figures 7 and 8, we conclude that semicoarsening with a selected number of coarse 
meshes is most effective for this flow problem. Semicoarsening is about 2.4 times faster 
than full coarsening, which does a surprisingly good job because of its low work count. In 
particular, the multigrid scheme with sequential semicoarsening converges (i.e., the residual 
was reduced 3 orders of magnitude) in roughly 33 sec (CPU time) on a Cray-YMP, which 
is 10 times faster than the single-mesh scheme. 

The flow over a slender forebody is chosen to represent a generic configuration that 
corresponds to a high-speed civil transport aircraft or an air-breathing space transportation 
system with low wave drag. The high Reynolds numbers yield thin boundary layers, which 
can only be resolved with highly clustered coordinate meshes and large-aspect-ratio cells. 
The mesh used for the present investigations is displayed in Figure 9. The cells near the 
wall have aspect ratios up to 25000. The flow computations were done with fixed transition 
at 2 percent chord and with the assumption of an adiabatic wall. Figure 10 shows the Mach 
contours for the mesh of 256 x 96 cells, and Figures 11 and 12 show the solution obtained on 
three successively refined meshes. Both the distributions of the skin friction and the wall 
temperature are accurately computed, even with only 25 points in the normal direction. 

Next we examine the convergence behavior of the multigrid schemes. The fine mesh 
with 257 x 97 points allows 11 grid levels to be used with semicoarsening. The full diamond- 
shaped tree of coarse meshes cannot be run because the time-stepping scheme is not well 
suited to handle the extreme aspect ratios that occur on the coarse meshes. With the 
proper half of the diamond, which includes the meshes with relatively low aspect ratios, 
the numerical solution converges. Figure 13 displays a comparison of the different multigrid 
strategies. The computations are started from a preconverged solution. Again, the scheme 
with sequential semicoarsening converges best. The differences between the multigrid 
schemes for this case, which has cells with very high aspect ratios, are larger than for the 
ramp flow. The final convergence rate of the scheme with sequential semicoarsening is 15 
times better than the rate with full coarsening. A comparison of the performance for the 
complete FMG process is given in Figure 14. The sequential semicoarsening scheme takes 
194 cycles and 570 sec to reduce the averaged residuals to 10 -2 on the fine mesh. The 
scheme with full coarsening takes 1024 cycles and 1430 sec, and the single mesh code takes 
7762 time steps and 6190 sec to achieve the same convergence level. Note that residuals of 
10 -2 correspond to a solution that is converged within plotting accuracy. If we compared 
computer times to reach lower levels of residuals instead, then the results would have been 
even better for the multigrid scheme with semicoarsening. 

CONCLUSIONS 

New multigrid schemes for hypersonic flow computations have been investigated. The 
basic solution algorithm employs upwind discretization and explicit multistage time step- 
ping. Various multigrid schemes with semicoarsening are introduced to overcome the 
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stiffness that results from the high-aspect-ratio mesh cells used to resolve viscous flows. 
The basic components of the algorithm are examined with a Fourier stability analysis ap- 
plied to the two-dimensional advection equation. Both the results of the Fourier analysis 
and the computations of high Reynolds number flows suggest that the semicoarsening ap- 
proach is effective. The convergence rates shown for hypersonic viscous flows are similar 
or even better than those previously published for the transonic regime in refs. 1 an . 
Further work is required to make the computational scheme less expensive. This need for 
more research is particularly true for the coarse meshes used within the semicoarsening 
approach, which make up the major portion of the overall work count of the scheme. 
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(a) Two levels, full coarsening, AR = 10 (CFL = 5.0, CFL - 2.4). 



(b) Two levels, sequential semicoarsening, weights = 


1.0, AR = 10 (CFL = 5.0, CFL' 


Figure 2. Contour plots of amplification factor for 5-stage Runge-Kutta 
scheme with first-order upwind approximation and 3 evaluations 
of dissipation (coefficients : 0.2742, 0.2067, 0.5020, 0.5142, 1.0). 




Figure 3- Control volume for nodal-point scheme. 



(a) Full coarsening. 


(b) Semicoarsening with simple averaging. 




(c) Sequential semicoarsening- 


(d) Semicoarsening with selective averaging- 


Figure 4. Multilevel schemes. 


(a) Mach contours . 




Figure 6 . Flow solution for ramp-flow problem. 
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Figure 7 . Influence of multigrid strategies on convergence for ramp-flow problem 
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Figure 8 . Influence of selected coarse meshes on convergence for ramp-flow problem 







Figure 1 1. Distribution of skin friction along forebody. 



Figure 12. Distribution of adiabatic wall temperature along forebody. 
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SUMMARY 


Finding the optimal position for the individual cells (also called functional modules) on the chip 
surface is an important and difficult step in the design of integrated circuits. This paper deals with 
the problem of relative placement, that is the minimization of a quadratic functional with a large, 
sparse, positive definite system matrix. The basic optimization problem must be augmented by 
constraints to inhibit solutions where cells overlap. Besides classical iterative methods, based on 
conjugate gradients (CG), we show that algebraic multigrid methods (AMG) provide an interesting 
alternative. For moderately sized examples with about 10000 cells, AMG is already competitive 
with CG and is expected to be superior for larger problems. Besides the classical “multiplicative” 
AMG algorithm where the levels are visited sequentially, we propose an “additive” variant of AMG 
where levels may be treated in parallel and that is suitable as a preconditioner in the CG algorithm. 

THE PLACEMENT PROBLEM IN INTEGRATED CIRCUIT LAYOUT OPTIMIZATION 


In this paper we present some results of research in algebraic m,ult,igrid methods (AMG). Our 
interest in these methods is motivated by an application arising in the layout optimization for 
integrated circuits. Modern integrated circuits consist of several millions of transistors. The layout 
optimization for an integrated circuit is usually based on grouping the transistors into cells (also 
called functional modules) like NAND/NOR-gates. This leads to the problem of finding the optimal 
location (placement) for hundreds of thousands of such cells on the chip surface. The goal of this 
optimization is to find a design that uses as little surface area as possible and that minimizes the 
time delay caused by long connections between cells. Short connections are desirable, because they 
permit higher clock rates and thus faster chips. 

Generally, finding the optimal layout for a given functional description of an integrated circuit is 
a formidable task. From a mathematical point of view the problem begins with the modeling of the 
above informal optimality conditions. Furthermore, cells cannot be positioned freely on the chip 
surface. Clearly, they must not overlap, so that we must consider their individual size and shape. 
Additionally, the manufacturing process introduces constraints on the locations permitted. 

•This research is supported by the SFB 0342 of the Deutsche Forschungsgemeinschaft (DFG) 
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Our research is cjone in the context of GORDIAN, a state-of-the-art layout synthesis package t 
that has been developed at the Institute for Electronic Design Automation, Technische Universitat 
Munchen, see Kleinhans, Sigl, and Johannes [1, 2]. Within this package, the placement problem is 
handled by breaking it into two separate steps, the relative placement and the final placement 

The purpose of the relative placement step is to provide a good initial guess for the final 
placement by finding the global optimum of a sequence of problems with a simplified optimality 
condition. After relative placement, only local effects are considered in the final placement, much 
simplifying the task of positioning a cell within the constraints. 

The global relative placement optimization is based on a force model where connections between 
cells are weighted according to their Euclidean length. Modules are connected by signals that can 
be interpreted as abstract connections of the cells. Implicitly, the positions of the signals are also 
subject to the optimization process. 

The functional in the relative placement optimization is quadratic with a positive definite 
M-matrix C whose entries represent the graph of connections between the cells and signals. 
Mathematically, the problem can be stated as 

min x T Cx — 2 b T x, (1' 

ieR" v ; 


where x,b e R", and C e Ft nxn . 

An unaugmented minimization of (1), however, tends to cluster the cells in the center of the 
chip. This is unrealistic, because there is too much overlap between the cells so that the final 
placement step would not be able to find acceptable positions for the cells. Therefore, the 
optimization is augmented with linear constraints of the form 

Ax = d, (2) 

that specify centers of gravity for groups of cells. These constraints are introduced successively by 
recursively partitioning the cells into groups with equal overall cell surface area, and assigning their 
center of gravity to subdomains of the chip surface. This is illustrated in Figs. 1 and 2, where the 
results after five successive partitioning steps with 1, 2, 4, 8, and 16 constraints are shown; see also 
Regler [3]. 


THE AMG ALGORITHM OF RUGE AND STUBEN 


Our application leads to a large, sparse, positive definite system of equations, which is in no way 
related to a partial differential equation. A typical matrix structure is displayed in Figure 4. The 
placement optimization program GORDIAN presently uses a preconditioned conjugate gradient 
method for the minimization in the relative placement step. We now study the suitability of 
algebraic multigrid as an alternative. Classical, geometric multigrid methods have been very 

tin fact, GORDIAN compared favorably at the 1992 “TimberWolf Hunt”, an international competition for place- 
ment algorithms 
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Figure 2: Partition 3, 4, and final placement 


successful for solving (1) when the matrix originates from the discretization of elliptic partial 
differential equations. Here, however, we need an algebraic multigrid method that works as a 
black-box solver given only the system matrix C and the right hand side vector b. 

In general, the key to multigrid methods is a family of smaller, coarse level systems 

min (x k ) T C k x k -2(b k ) T x k , (3) 

z*'€R"* 

for A: = 1,2,..., K, where the superscript denotes the level and where x k ,b k € R n , and 
C k £ R n xn \ and where the dimensions n k form a decreasing sequence 

n 1 > n 2 > • ■ • > n K > 1. 

The original system coincides with the first and largest problem in the family, C = C l , b = b 1 . 

For an AMG algorithm, the sequence of matrices C k must be constructed algebraically. The 
smaller C k are computed successively by selecting a subset of the unknowns of the level k — 1 
system and by evaluating the strength of the connections between the unknowns in C k 1 . The basis 


499 



for this paper is the AMG method of Ruge and Stiiben [4] that uses the assumptions 

C = (lij)i<i,j<n symmetric positive definite, 

7 ij <0 for 1 < i,j < n, i ^ j, 

Ej=i 7 ij >0 for 1 < j < n. 


With (4) the effect of Gauss-Seidel iterations on C is well understood and can be used to guide the 
construction of the coarser level systems C k for k = 2, 3, . . . , K. 


AMG methods were first introduced in the early eighties by Brandt, McCormick, and Ruge 
[5, 6, 7], AMG is necessarily less efficient than highly specialized geometric multigrid solvers for 
elliptic problems on uniform rectangular grids. However, for more complicated cases with complex 
domains, AMG has been shown to behave quite favorably in terms of operation count and CPU 
time. AMG also works for problems where geometric multigrid methods are impossible to design. 
In this paper we will show that AMG works very satisfactorily even for the matrices in chip design. 

The generality of AMG must be paid for by a setup phase that may take 80% or more of the 
overall time. This setup is needed to construct the sequence of reduced matrices C k together with 
appropriate transfer operators from level k to level k + 1. 


jk+ 1 . prc* ^ 


This step is quite expensive and contains code that does not vectorize or parallelize well. 


( 5 ) 


We will briefly review the AMG algorithm, as introduced by Ruge and Stiiben [4, 8]. The most 
interesting part may be the setup routine to build the family of systems (3) with the tr ans fer 
operators (5). 


The matrices C k are constructed such that each of the unknowns x k on level k, (k > 1), will 
represent an unknown on the next finer level k — 1. The level k — 1 unknown represented by x k on 
level k is denoted by x k ^J , the corresponding finer level unknown. This naturally partitions the 
unknowns on each level (except the coarsest) into those that correspond to a coarser level unknown, 
and those that do not. These will be called the C- and i^-unknowns of a level, respectively. The 
partitioning is performed in two phases on each level. At the beginning of the first phase, the 
unknowns with strictly diagonal dominant matrix rows are determined. These unknowns are not 
restricted to a coarser level. 
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SETUP PHASE I: 



Next, in a second phase the final C-point choice is made. 
SETUP PHASE II: 



In these algorithms we use 


d(i,S) := 


max{— 7ifc} 
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and Si {j € Ni | d(i, {j}) > a}, Sf {j\i£ Sj}, where N, := {j | j ^ i, 7^ ^ 0} is the set of 
neighbors of i. 

After the unknowns of the level have been partitioned, the interpolation operator 
-ffc+i : R n — > R n is defined by v k = I k +1 v k+1 


( v k+l for i e C k 

V M = { - Ei € c* la v i +l hu for ieF k \ F k . (6) 

(.0 for i G F% 

The coarse level system for level k + 1 is now defined by the so-called Galerkin or variational 
conditions. The restriction operator is the transpose of the interpolation operator 

it' = (iLlf ( 7 ) 

and the reduced system matrix is 

c M = it'cn t,. (8) 

Note that all coarse level matrices inherit the positive definiteness from C, provided all /£ +1 have 
full rank. 

The AMG algorithm can now be described as follows. 

1. set k = 0 

2. Do set A; = k + 1; SETUP PHASE I and SETUP PHASE II; until \Cl k \ = 1 

3. While ||6 - Cx || > 6 

MGSTEP(l) 


MGSTEP(k): 

1. If k = K then solve (11) 

2. else SMOOTH^) 

set 6 fc+1 = I k+1 (b k — C k x k ) 

MGSTEP(k-t-l) 

set x k = x k + Ik + i% k+1 

SMOOTH(x fc ) 

3. endif 
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VARIANTS OF AMG 


We now discuss the handling of constraints in the AMG-algorithm. Just like the system matrix, 
the constraints (2) must be transferred to the coarse levels. Equation (2) thus becomes a family of 

constraints , , /nX 

A k x k = d k , ( Q ) 

for k = 1, 2, . . . , K corresponding to the reduced systems (3), where 

A k+i = A k I k +1 . ( 10 ) 

The original matrix coincides with the first and largest problem in the family, A = A 1 , d = d\ 

The algorithm is modified such that (9) is satisfied on each level. On the coarsest level this is 
accomplished by solving the system with constraints directly using a Lagrange multiplier approach 


\C K 

(A K ) T ' 



' b K ' 

A K 

0 

A 


d K 


The definition of the coarse level equations and constraints by a Galerkin condition has the effect 
that the finer level equations remain satisfied after a coarse grid correction, provided the coarse 
level constraints have been satisfied. 

After each smoothing step, the constraints will be violated. This is compensated by an 
additional projection that enforces the constraints. Note that for general constraints the transfer 
can lead to coarse grid problems that are not well defined. This has been studied in detail in 
Bungartz [9]. Even if both A k and I k +l have full rank, A k l£ +l may not. In thte case constraints 
have become linearly dependent and the subspace determined by A k+1 x k = d + is either 
overdetermined or empty. In the case of overdetermined constraints, the number of constraints 
should be reduced. Numerically, however, detecting and treating this situation is difficult. Ideally, 
the matrix of constraints A k should already be considered in the coarse level setup. 

Here, we concentrate on the type of situation arising in the placement problem. With each 
constraint, a group of cells is assigned to a subdomain. Each cell is uniquely assigned to one such 
subdomain and the coefficients of the matrix A k are determined by the relative surface area of the 
corresponding cells. Clearly, the rows of A k are orthogonal. The coarse level constraints will remain 
consistent, if the interpolation I k +1 is constructed such that a coarse level variable only interpolates 
variables belonging to the same subdomain. Unfortunately, the constraints are still unknown in the 
(first) setup phase. In practice inconsistencies rarely arise, if we guarantee that the dimension of 
the coarsest level is larger than the number of constraints. 

On the coarsest level the Lagrange multipliers A must be calculated. This requires the solution 
of a full system of a dimension that is equal to the number of constraints. The number of 
constraints doubles with each partitioning step. Thus the coarsest permissible level may be quite 
large and expensive to solve exactly, making the algorithm unacceptable for large chips. 
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Figure 3: The double arrow shows in which direction the solution is calculated, here it is started in 
y direction. The box indicates the subdomains for which the computation is performed. The dashed 
line indicates a separation of the regions; cells cannot cross such a line during the overall placement 
calculation. 

Experience shows that the influence of cells in different subdomains is rather small and may be 
neglected. Additionally, earlier experiments with GORDIAN have shown that the quadratic 
objective functional is only a crude approximation to the true one. It can be argued that the usual 
routing of connections in the final layout induces a measure of distances that is modeled better by 
an Za-like norm and a linear objective functional (see Sigl [10]). This motivates an algorithm that 
recursively splits the problem into independent ones by partioning into subdomains. A solution 
subdomain is defined as two neighboring subdomains that have been obtained by partioning a single 
subdomain of the previous iterations. We can now simulate the effect of a linear objective 
functional by keeping the cells fixed in all subdomains except those in the current solution 
subdomain. This must be repeated for all solution subdomains. Thus, though the above 
simplification changes the mathematical model, the modified algorithm may help to produce better 
overall layouts. This is indicated by experimental results. 

The algorithm is illustrated in Figure 3. The first two calculations are performed as before, 
without any change. After the second partitioning, the computation of the overall chip is split into 
an upper and a lower solution subdomain. When the new solution for the upper solution 
subdomain is computed, the cells of the lower solution subdomain are kept fixed. Simultaneously, 
the lower solution subdomain is computed with fixed upper domain cell positions. This is repeated 
recursively until the partitioning is completed. Clearly, this algorithm can be easily parallelized 
because each solution subdomain can be computed independently. Because no data exchange 
between the different solution subdomain is necessary, this is a plain divide -and- conquer algorithm 
inducing a natural parallelization. Note, that we have to solve systems with at most two 
simultaneous constraints. This leads to an algorithm, where it is sufficient to perform the setup 
once at the beginning of the computation. Before each optimization step the (at most two) 
constraints are tested for linear dependencies. In the case of inconsistent constraints the previous 
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level is taken for the coarsest level. 


The conventional setup of the coarse level matrices is variational in the sense that (7) and (8) 
are satisfied. Experience shows that the coarse level matrices tend to fill up rather quickly. On the 
other hand, the definition by equation (8) often leads to small matrix entries, so that one may have 
the idea to modify the coarser matrices by dropping small entries. More precisely, we may perturb 


each C k to 

C k = C k + £ fc , 


( 12 ) 


such that the matrix remains sparse. This will not only speed up each individual iteration, but also 
simplify coarser matrix setups. We suggest performing the perturbation such that the matrix 
remains symmetric and such that dropped values are added to the diagonal with the opposite sign. 
For an analysis of these perturbations see Muszynski, Rude, and Zenger [11], Bungartz [9], and 
Chang and Wong [12]. 


Classical AMG is used with a single sweep of Gauss-Seidel smoothing on each level. 

Alternatively, we may use Jacobi-type smoothers. As usual, the Jacobi method must be damped to 
obtain good smoothing. Though the Jacobi method is usually a less efficient smoother than 
Gauss-Seidel (even with optimal damping), it may be an interesting alternative, because it has a 
symmetric error propagation matrix without performing sweeps in reverse order. Jacobi- AMG may 
thus be used directly as a preconditioner for the conjugate gradient method. Another advantage of 
Jacobi is parallelization. To parallelize Gauss-Seidel we would have to find a coloring scheme for a 
general unstructured matrix that permits the parallel execution of relaxation steps. Our 
experimental results (see Figure 5) indicate that two optimally damped Jacobi iterations are about 
as good a smoother as a single sweep of Gauss-Seidel. This is in agreement with experience for the 
solution of partial differential equations. In future work we intend to experiment with other 
smoothers, like conjugate residuals or incomplete LU decomposition; see e.g. Bank and Douglas 
[13]. 

We denote the diagonal part of C k by D k and can thus write a damped Jacobi iteration for level 
k as 

x k x k + u(D k )~ l {b k - C k x k ), (13) 

where u is the relaxation parameter. For the error e = x — C~ l b in the original system, a relaxation 
on level k has an effect that can be described by 

e*— (I — ll(D k )~ l I k C)e, (14) 


where 


A‘=nV- 

i= i 


(15) 


The AMG algorithm in its simplest form (with a single sweep of Jacobi on each level) has an error 
propagation 

(16) 


K 


e-n (/-/JPT'/fCJe. 

fc=l 


This is a typical multiplicative method. 


505 


All conventional multigrid methods, including AMG, are multiplicative algorithms in the sense 
that the levels are visited sequentially in a predetermined order. The recent development of 
multilevel methods has led to the formulation of a class of additive multilevel methods. These 
include the AFAC type algorithms (see McCormick [14]), the BPX method (see Bramble, Pasciak 
and Xu [15]), and the multilevel additive Schwarz methods (see Dryja and Widlund [16]). Formally, 
these methods do not form a product of operators as in (16), but a sum, whose terms can — in 
principle — be computed simultaneously. 

With some exceptions (like the AFAC method), additive methods provide only preconditioners 
that are divergent when used as iterations by themselves. However, they define operators with 
improved condition numbers, and so they will lead to fast convergence when suitably damped or 
when they are used in combination with self-scaling iterative methods, most notably the conjugate 
gradient algorithm. Recent results have shown that these methods can have typical multigrid 
efficiency with convergence rates independent of the problem size. 

We will show that for our problems 


P k ^Y d l){LP)- l l{C (17) 

3 = 1 

also has a better condition number than the original matrix C. Note that an application of P k does 
not require the explicit construction of the corresponding matrix, but only the restriction of the 
residuals to all levels, just like in conventional AMG. An iteration based on P k , like 

x = x + ujJ2ll ( 0*)" 1 J? (6 - Cx) (18) 

will only converge, when suitably damped with u> < 1. Preferably (18) is used as a preconditioner 
for a conjugate gradient iteration. 


NUMERICAL EXPERIMENTS 

Our first example is a typical benchmark chip called Prim, ary / with 752 cells, 81 fixed cells, and 
902 signals. Figure 4 shows the corresponding matrix structure, and Figure 5 displays the 
convergence history of (multiplicative) AMG using different smoothers for the solution of (1). 
Clearly two sweeps of damped Jacobi are almost as good a smoother as Gauss-Seidel (GS). In 
Table 1 the minimal and the maximal eigenvalue (A min , X max ) plus the condition number 
k = Xmax/Xmin of P k are shown. On the coarsest level (k = 6) D k is replaced by C k . This means 
that the coarsest level equations are solved exactly. In Figure 6 we present the corresponding 
spectrum of the eigenvalues for k = 1, 3, 6. In each case, the first few eigenvalues are marked by 
asteriks(*) and diamonds(o), respectively. In column 5 and 6 of Table 1, the density and dimension 
of the coarse level system C k are displayed additionally. 

In Figure 7 we show the convergence history for preconditioned CG in analogy to Figure 5 in 
comparison to the AMG-solver with Gauss-Seidel smoothing. Conventional AMG is superior to 
AMG-preconditioned CG, partly because Gauss-Seidel is a better smoother than Jacobi. However, 
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Euclidean Residual Norm 



Figure 4: Sparsity pattern of system matrix of Primary I 



Figure 5: Convergence history for AMG with different smoothers 


k 

\ min 

^max 

K 

density 

dim 

1 

0.0085 

1.8301 

215.3 

0.02 

752 

2 

0.0302 

3.4020 

133.4 

0.09 

343 

3 

0.0598 

4.2936 

71.8 

0.33 

147 

4 

0.1117 

4.9196 

44.6 

0.72 

59 

5 

0.2500 

5.9495 

23.8 

0.95 1 

26 

6 

0.4060 

6.1167 

15.1 

1.00 

10 


Table 1: Eigenvalues and characteristics of Primary I 







Figure 6: Eigenvalues of P k for Primary I matrix 



Figure 7: Convergence history of preconditioned CG for Primary I 

AMG-preconditioned CG is an interesting alternative when we consider its potential for 
parallelization. 

In further tests we have applied the AMG algorithm to a problem of similar size arising from the 
discretization of a partial differential equation and have found that the behavior is surprisingly 
similar. 

Finally, we present results for a real-life chip with 25178 cells. The original preconditioned CG 
solver (CG) in GORDIAN is replaced by the AMG routines combined with the divide and conquer 
strategy. In Table 2 we compare the CPU times for CG and AMG for the optimization after each 
partitioning step. The first AMG step includes the setup time, which is 9 times as expensive as the 
iteration itself, but still faster than CG. AMG outperforms conventional CG for almost all subpro- 
blems, except the very last six partitions. In the overall time AMG is still clearly superior to CG. 


508 

























































Following the divide and conquer strategy, we transform the system into separately solvable 
subproblems after the second partition. This requires a transformation of the data that is not yet 
optimally implemented. The column labeled “sol” therefore shows the time for the AMG solution 
process without the overhead for this data transformation. The overhead for the transformation 
increases with the number of partitions, adding to the cost of the AMG method. 

However, each subdomain can be computed in parallel. To illustrate the potential for 
parallelization, the “par” column shows the maximal time needed for computation of a subdomain, 
thus simulating the effect of an optimal parallelization. The example chip for this calculation is a 
standard cell chip. This type of chip has a fixed number of rows of cells. Thus subdomains with 
height below a certain minimum are not permitted. To avoid this, GORDIAN computes the 
partition for both directions until the maximal number of rows is reached. Here, this applies to 
partition 9,10,11 during conventional CG; for the AMG method this happens during partition 
11,12,13. As the partition progresses, the original AMG setup may not be suitable any more and 
must be repeated for the subdomains that cause trouble. In our example this has been the case in 
partitions 5,8, and 9. For further discussion see Regler [3]. 

CONCLUDING REMARKS 


We have discussed the application of algebraic multigrid methods and have proposed several 
variants and extensions of the classical AMG method of Ruge and Stuben, including constrained 
optimization and a new additive algorithm. We have shown that the AMG method is a highly 
competitive alternative for the layout optimization of real life chips. 

Acknowledgements: We wish to thank H. Bungartz, K. Doll, F. M. Johannes, G. Sigl, and C. 
Zenger for many helpful discussions. 
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Summary 


In this paper we describe a novel generalized SOR algorithm for accelerating the 
convergence of the dynamic iteration method known as waveform relaxation. A new 
convolution SOR algorithm is presented, along with a theorem for determining the 
optimal convolution SOR parameter. Both analytic and experimental results are given 
to demonstrate that the convergence of the convolution SOR algorithm is substantially 
faster than that of the more obvious frequency-independent waveform SOR algorithm. 
Finally, to demonstrate the general applicability of this new method, it is used to solve 
the differential-algebraic system generated by spatial discretization of the time-dependent 
semiconductor device equations. 


Introduction 


To achieve highest performance on a parallel computer, a numerical algorithm must 
avoid frequent parallel synchronization [1] . The waveform relaxation approach to solving 
time-dependent initial-value problems is just such a method, as the iterates are waveforms 
over an interval, rather than single timepoints [2, 3, 4]. Like any relaxation scheme, 
efficiency depends on rapid convergence, and there have been several investigations into how 
to accelerate WR [2, 5], including using multigrid [6] and conjugate direction techniques [7]. 

In this paper, we investigate using successive overrelaxation (SOR) to accelerate WR 
convergence. In particular, we show that the pessimistic results about waveform SOR 
derived in [2] can be substantially improved by replacing multiplication with a fixed SOR 
parameter by convolution with an SOR kernel. We derive the optimal SOR kernel using 

* This work was supervised by Professors Jacob White and Jonathan Allen and supported 
by a grant from IBM, the Defense Advanced Research Projects Agency contract N00014-91- 
J-1698, and the National Science Foundation contract MTP-8858764 A02. 
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Fourier analysis techniques and then demonstrate the effectiveness of the approach for a 
model parabolic problem. Finally, we demonstrate the general applicability of the approach 
by using the method to solve the time-dependent drift-diffusion equations associated with 
modeling semiconductor devices. 

We begin in Section 2 by reviewing waveform SOR, and in Section 3 we relate the 
algorithm to pointwise SOR to demonstrate the difficulty in accelerating WR with a fixed 
SOR parameter. In Section 4, we use Fourier analysis to derive the SOR kernel for the 
continuous WR algorithm, and give a proof of optimality. In Section 5 we briefly consider 
the effect of time-discretization, and in Section 6 we apply the method to device simulation. 
Finally, conclusions and acknowledgements are given in Section 7. 


Waveform SOR 


In this section, we consider applying waveform relaxation methods to the model linear 
initial-value problem 

(1) (jt + A ) *(0 = b ( t ) with *(0) = x 0 , 

where A E ® nxn , b(t) E M n is a given time-dependent right-hand side vector, x (t) E IP" is 
the unknown vector to be computed over simulation interval t E [0,T], and x 0 E R n is an 
initial condition. 

Given the relaxation splitting A = D — L — U , and subtracting successive waveform 
relaxation iterations, the waveform Gauss-Jacobi (WGJ) and waveform Gauss-Seidel 
(WGS) iteration equations, respectively, may be written as: 

(2) (i + D) A x M (t) = ( L + U ) Aa:*(t) 

(3) (i+D-L)Ax M (t) = UAx k (t), 

where A*** 1 ^) = ^(f) - x k (t) is used to eliminate the right hand side b(t). 

The waveform SOR method for acceleration of WGS is a simple extension of algebraic 
SOR. To derive the waveform SOR iteration equation, compute a waveform x^(t) on 
t E [0, T], as in WGS: 

( 4 ) (ft a-ijXiit) with xf fl (0)=x Q , 

j - 1 j— *+l 

and then update x k (t) in the iteration direction by multiplication with an overrelaxation 
parameter u, 

(5) (t) *- x^t) + u • [xf 1 (t) - x\ (t)] . 


* **' 


■*"r. 
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t r 



Combining equations (4) and (5) yields 

( 4 +<*)*,“(«) = 

b i(t) — E a ij X i^ fo) — E a *j a 'i (*) * 

i=i i=»+i 

which, after subtracting successive waveform relaxation iterations, leads to 

(7) (i + D ~ Aajfcfl (*) = K 1 - u )(i + D ) +u;C/ ] 

where Aa;** 1 ^) = x m (t) - x k (t). 

Note that the iteration matrices implied by equations (2), (3) and (7) correspond 
exactly to the standard algebraic relaxation and SOR matrices with diagonal matrix D 
replaced by {j t 4- D). Also note that waveform SOR as defined by (7) is not the same as 
the dynamic SOR iteration considered in [2], because, unlike WGJ or WGS, the waveform 
SOR iteration equations are not of the form 

(8) j t Ax k+1 + MA® fc+1 = NAx k 
where M,N G M nxn . 


(6) (1 [ ( dt 4* a *») X i fa)] ^ 


Relation to Pointwise SOR 


Discretizing (1) in time using a multistep integration method yields 

(9) E “ 3] = h E^j ( b i m ~ il ~Ax[m- 3 ]) , 

j = 0 0 

where q 0 = 1 and x[m] denotes x(t) at timepoint t = mh with timestep h. Thus, the 
time-discretized model problem can be rewritten as a sequence of linear algebraic problems 

[/ + h/3 0 A^ x[m] = 

3 9 

(10) hp 0 b[m | - E otjx[m - 3) + h E Pi ( b i m “ il - Ax[m - j]) . 

j = 1 i=i 

We now compare the convergence of the waveform SOR method to the convergence of 
pointwise SOR, in which algebraic SOR is used to solve the matrix problem at each 
timepoint. 

The pointwise SOR iteration equations are derived by applying the relaxation splitting 
A = D — L — U to equation (10) and taking the difference between the (fc-f-l)st and fcth 
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iterations. More precisely, the pointwise SOR iteration equation applied to solve (10) for 
A® fc+1 [rn] = a^fm] — is 

[ (I + h/3 0 D) - uhp 0 L\ A* w [m] = 

(11) [ (1 - u) (/ + h0 o D ) + uhfoU] A**[m]> 

where u is the SOR parameter. It follows that the spectral radius of the iteration matrix 
generated by pointwise SOR at the mth timestep is 

(12) p ([ (/ + h0 o D ) - uhPoL] _1 [ (1 - w) (/ 4- h(3 0 D ) + . 

If waveform SOR is used to solve the model problem (1), and a multistep method is 
used to solve iteration equation (7), then A* w [m], now denoting the discretized difference 
in waveform iterates, satisfies 

E q j [A* w [m - j] - (1 - a/) A x k [m - j]l = 
i=o 

(13) hJ2fy{- (D -uL)Ax M [rn-j] + [(l-ij)D + uU]Ax k [m-j]}. 

j=o J 

This can be rewritten as the discrete-time analogue of (7): 

a 

E [ ( a j r + WjD) - uhPjL] A» w [m - j] = 

3=0 

3 

(14) E [ (1 - ") («i J + hfijD) + ujh{3 j u]Ax k [m - j ]. 

3=0 

As the similarities of equations (11) and (14) suggest, if the time interval is finite, 
i.e. the number of timesteps is some finite L , then for a given timestep h and a given 
SOR parameter u, the time-discretized waveform SOR method has the same asymptotic 
convergence rate as pointwise SOR. 

Theorem 3.1. On a finite simulation interval, the iterations defined by (11) and (14) 
have the same asymptotic convergence rate. 

Proof. Let y k denote the large vector consisting of the concatenation of vectors Ax k \m] 
at all L discrete timepoints, i.e. y k = [A«*[1] T , . . . , A**[Z,] T ] . Collecting together the 
equations (14) generated at each timepoint into one large matrix equation in terms of 
vectors y k+1 and y k yields MAy k+1 = NAy k where M } N € w LnxLn are block lower 
triangular banded matrices, with blocks of size n x n, and with block bandwidth s. It is 
then easily seen that M~ l N is block lower triangular, with diagonal blocks equal to 

(15) [ (I + h/3 0 D) - uh0oL] [ (1 - w) (J -I- h0 o D) + uhfau]. 
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Therefore, p(M~ 1 N) is given by (12), implying that the iterations defined by (11) and (14) 
have identical asymptotic convergence rates. *- 


Theorem 3.1 suggests that parameter u for waveform SOR should be chosen to be 
precisely equal to the optimum parameter for the pointwise SOR method. However, this 
does not necessarily lead to fast convergence, as the following example illustrates. 

Example 3.1. Let t G [0,2048], *(0) = 0, and let matrix A € E 32x32 and time- 
dependent input vector b(t) € E 32 of the model problem (1) be given by 


A 


(16) 


b(t) 


2 -1 

-1 2 


h(t) 

0 

0 


-1 


-1 

2 


where 6j(t) = 


[ 1 — COS 1 

(2-nt\ 

l 0 

^256/ 


if t < 256 
otherwise. 


Consider the four problems generated by discretizing in time with the first-order backward 
difference formula, using 64, 128, 256, and 512 uniform timesteps of size h = 32, 16, 8 and 
4 respectively. 


Since the tridiagonal matrix A is symmetric and is consistently ordered [8, 9], the 
matrix (I + hfl Q A) of the pointwise time-discretized model problem (10) is also consistently 
ordered, and the optimum pointwise SOR parameter u >opt is given by 


(17) 


2 


^ Opt 


1 + \/ 1 Mi 


where pi = p(H G j ) is the spectral radius of the pointwise Gauss- Jacobi iteration matrix 
H G j = (I + h(3oD)~ l {hfoL + h/3 0 U). For the four problems with 64, 128, 256 and 
512 timesteps, the optimum pointwise parameters u opt are 1.669, 1.586, 1.482 and 1.364 
respectively. 

Curves PT64, PT128, PT256 and PT512 of Figure 1 show the convergence of the 
waveform SOR method versus iteration for the four problems with their optimum pointwise 
SOR parameters u^. Note that as the total number of timesteps is increased, the initial 
convergence rate is slower, approaching a limiting value of the convergence rate of the 
continuous Gauss- Seidel WR algorithm (shown as WR in Figure 1). In each case, the 
convergence rate of the waveform SOR eventually approaches the expected asymptotic 
value of uj 0 pt - 1. Note that with a reasonable error accuracy tolerance such as 10“ 6 as a 
stopping point, the asymptotic convergence rate is ne ver reached. For comparison, Figure 1 
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also shows the superposition of four convergence plots (CSOR) of the new convolution SOR 
method to be introduced in the following sections. 



Fig. 1 . Convergence of waveform SOR using 
the pointwise optimal parameter ( PT ) compared to 
waveform relaxation (WR), and convolution SOR 
(CSOR) f with 64, 128 , 256 and 512 timesteps. 



Fig. 2. Effect on convergence of the 256-timestep 
waveform SOR of varying the SOR parameter from 
the pointwise optimum ~ 1.482. 


To illustrate the effect of choosing a different SOR parameter u>, Figure 2 shows the 
convergence versus iteration of the 256-timestep example for waveform SOR with values 
of the SOR parameter u not equal to the pointwise optimum = 1.482. When 
u} = 1.30 < uj opt , the convergence curve lies between the pointwise optimum curve and 
the WR convergence curve, i.e. both initial and asymptotic convergence rates are slower. 
By increasing the SOR parameter to u = 1.63 > u opt , the initial convergence rate can 
be made faster at the expense of slowing down the asymptotic convergence rate. But as 
the u> = 1.70 curve shows, once the SOR parameter is increased beyond some point, the 
waveform SOR method may appear to diverge before eventually converging. Also, the 
solution produced by the uj = 1.70 example contains spurious oscillations, as shown in 
Figure 3. Note both the growth and translation of the oscillation with iteration. 

The optimum pointwise SOR parameter cJopt does not dramatically improve the 
convergence rate of waveform SOR because the matrix M 1 /V which describes the 
waveform SOR convergence is far from normal. This suggests that although the spectral 
radius of the iteration matrix determines the asymptotic convergence rate of waveform 
SOR, it does not determine the practically observable convergence rate. The convergence 
rate could be characterized, for example, by computing the pseudo-eigenvalues [10] of the 
waveform SOR iteration matrix. In the following section, we take an alternate approach. 
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Fig. 3. Delta waveform Ax^g 1 ^) = xjg 1 (<) - 
Xj 6 (t) versus time after iterations 250 and 500 } 
for the 256-timestep waveform SOR method using 
uj = 1.70, showing the growth and translation of an 
oscillating region . 


Fig. 4. The spectral radii as functions of frequency 
SI of the Gauss- Jacobi WR (solid), Gauss-Seidel 
WR (dashed) and waveform SOR (dotted) iteration 
matrices for an 8 x 8 version of the continuous-time 
problem of Example 3 A. 


Fourier Analysis 


In [2], the spectral radius of dynamic iteration operators which map x k to x^ 1 , such 
as those given by equations (2), (3), and (8), was related to their Fourier transform. In this 
section, we make a more detailed use of Fourier analysis to derive a frequency-dependent 
SOR parameter for the waveform SOR operator of equation (7). 

The Fourier transform of x k (t) is given by 

(18) x k (in) = r x k (t)e- iCl dt = ^‘(t)}, 

x J — oo 

where Q is frequency. Standard Fourier identities can be used to show that A®** 1 (ill) = 
H(itl) A x k (iQ), where for WGJ (2), WGS (3) and waveform SOR (7), the iteration 
operator H (iQ.) is given by 

(19) H a j(in) = { m + D)~\L + U ) 

(20) H GS (ity = (iHJ + D - L)~ l U 

(21) ffsofl(iO) = (ini + D -wL) _1 [(l -w)(*n/ + D) + uU] 

respectively. The obvious interpretation of equations (19)-(21) is that the spectral radius 

p(H(in)) yields the asymptotic convergence rate for errors in the frequency component O. 
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Figure 4 is a plot of the spectral radii of H GJ {iPl), H G s(i£l) and H S oii{ity for an 
8x8 version of the continuous-time problem given in Example 3.1, using u = 1.49 for 
H sor(^)- From the plot it is clear that very high frequency components of the error are 
damped much more quickly than low frequency components. However, the spectral radius 
p(H soii{ify) is greater than one over a range of frequencies, and therefore the waveform 
SOR iteration magnifies errors in this frequency range. This effect was predicted in [2] and 
is easily seen in Figure 3. 

This situation can be remedied by using a generalized SOR algorithm, in which 
equation (5) is replaced by an overrelaxation convolution with a time-dependent SOR 
parameter u(t), 

/ CO r t 

^ w(f - t) • [if* 1 (r) — X* (r )] dr. 

The Fourier transform of the SOR operator is then given by 

(23) Hc(iQ) = + D — u/(if2)Lj ^1 — u;(zfi)^ (ifll j- D'j + u;(zfi)t/j, 

where u>(iO) is the Fourier transform of the time-dependent u>(t). We refer to the SOR 
algorithm represented by iteration matrix (23) as the convolution SOR algorithm (CSOR). 
The theorem below, which is the main result of this paper, gives a formula for determining 
the optimal frequency-dependent SOR parameter w(z'Q). 


Theorem 4.2. If the spectrum of H GJ (iQ ) lies on the line segment [— /ii(iQ),/q(*'Q)] 
with |/ii| < 1, then the spectral radius of H G (iO) is minimized at frequency by a unique 
optimum u>(ifi) = u opt (iO) € C given by 

2 

u> O pt(i0.) = , 

1 + ^1 ~ Ab(*0) 2 

where \f- denotes the root with the positive real part. 

Proof. For brevity, the argument (zO) will be omitted in the following, and H g {uj) will 
denote the convolution SOR operator (at frequency Q) computed using SOR parameter u>. 

Let pi — TiPi denote each eigenvalue of H GJ , where € [—1,1]. Classical SOR 
theory [8, 9] guarantees that for each ^ = np, i, there is an eigenvalue A i of H c (u) which 
satisfies 



(25) 


A i - UlTiPx 


y/Xi + (u> - 1) = 0, 


and therefore, from the quadratic formula, 

( 26 ) ^ = + 
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Let uj to be the conjectured optimal u opt . Combining equation (24) with (26) yields 


(27) \/W = “opt ri + ^Jrj - 1 - 

where the rightmost equahty follows from the fact that U + y/rj — l = 1 for r t € [-1, 1]- 


And as (27) holds for all i, 

(28) p(H c{w op t)) = M = j^Vi W 0 pt) | = \Uopt ~ 1| • 


Equation (28) implies that p(H c {v)) cannot be decreased below p(H c {u op t)) by using 
an u such that \u - 1| > \uJ opt - 1|. This follows from the fact that, in general [8, 9], 

(29) p{H c {u))> |w-l| 


for any uj. 

To show that p{H c {uj)) also cannot be decreased by choosing a value of u such that 
— 1| < \uopt — 1|, consider the eigenvalue Xj corresponding to p\. 


(30) 




$UJ 2 


— u + 1 


and note that /+ : C — ♦ C, given by equation (30), is a single-valued, continuous function 
that is analytic except at 


(31) 


Wl» u>2 


2 

1 ± \/l _ Z 1 ! 


Since \pi\ < 1, points uj x and u 2 lie in the interior and exterior, respectively, of the 
circle \u - 1| = 1 in the complex u;-plane. Note that equals the conjectured uj^t from 
equation (24). 

Let D denote the interior of the curve given by the perimeter of the circle \u - 1| = 1, 
except with a cut along the line defined by the circle’s center and u x . The cut follows the 
fine from the perimeter down to u> 1} and then back up the other side to the perimeter, as 
shown in Figure 5. The function /+ is nonzero everywhere within D, since equation (25) 
implies that a zero can occur only at u — 1, and /+(1) = Pi- Therefore, the minimum 
modulus theorem [11] implies that |/+(u;)| attains its minimum value somewhere on the 
boundary of D. Finally, the lower bound in (29) implies that = u opt in (24) is the only 
point on D which can achieve as low a p(H c {u)) as given in (28), completing the proof. □ 


Note that when the eigenvalues p lie on a real line segment, this is yet another 
alternative proof of a classic SOR Theorem [8, 9, 12]. Also note that, in general, the 
optimal overrelaxation parameter u(iQ) is complex. 


521 



Fig. 5. The region D and branch cuts in the complex w-plane. 

The conditions of optimal SOR parameter Theorem 4.2 are satisfied by a large class of 
matrices. 


Corollary 4-3. If A in (1) is consistently ordered, symmetric, and has constant diagonal 
D = dl, then the optimal SOR parameter is given by (24), 


(32) 


^opi(ffl) — 


2 


1 + 




2 ’ 


where fa denotes the spectral radius of D l (L + U). 

Proof. To show that Theorem 4.2 applies, note that (19) implies that for a constant 
diagonal A, the H G j{iQ ) eigenvalues n(iQ) are given by /z(iQ) = dfa/{d + iCl), where fa 
are the eigenvalues of D l (L + U). Since fa lie on the real axis, the lie on a fine 

rotated in the complex plane. □ 


Corollary 4 . 4 . If A in (1) is consistently ordered, symmetric, and has constant diagonal 
D = dl, then the optimal time-dependent SOR convolution waveform u(t) is real. 

Proof Equation (32) implies that (ifl) is a conjugate-symmetric function of Cl. □ 


Discrete-time Modification 


For the sake of brevity, we consider only the first-order backward difference formula, 
in which case equation (14) becomes 

+ h(D — u>L)Ax M [m\ — A x m [m — 1] = 

(33) (1 - u) A**[m] + [ (1 -u)hD + huU] Ax k [m) - (1 - u) A x k [m - 1], 
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where h is the uniform timestep. The ^-transform of x[m], defined by 


(34) 


oc 

x(z ) = Y2 x[m]z- m = Z{x[m]}, 

n=-o o 


may be used to show that A x M (z) = H c (z) A x k (z), where the ^-dependent convolution 
SOR operator is 


Hc(z) = 


l-z~ 

h 


-I + D — u(z)L 


| (l — tu(z)) ^ — j - — I + D^J+u(z)U 


Since u ;(z) depends on z, overrelaxation becomes a convolution sum 


(35) xf 1 [m] <— Xi [m] + YL ~ l?l) > 

J=— oo 

where u >(z) = Z{u[m}}. To determine the optimal u(z), we have the following theorem, 
whose proof is analogous to that of Theorem 4.2. 


Theorem 5.5. If the spectrum of H G j(z ) lies on the line segment [-fii(z), fii(z)} with 
|/ii| < 1, then the spectral radius of H G (z) is minimized at z by the unique optimum 
u{z) = u> op t(z) e C given by 

(36) “ V ‘ W = l + ^i-wW 1 

where sf- denotes the root with the positive real part. 


In Example 3.1, matrix A has constant diagonal D = dl, so that 


(37) 


Uopt{z) 


2 


( \ 

1 

dn i 

1 — 

, 1 - z" 1 

i 

\ d+ h / 


where pi denotes the spectral radius of D~ l (L + U ). Thus, to compute the optima 
convolution SOR sequence u[m] for the four CSOR plots of Figure 1, equation (3 ) was 
used to compute u/(z), and then the inverse z-transform of u>(z) was computed analytica y 
by series expansion. 
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Device Transient Simulation 


A device is assumed to be governed by the Poisson equation, and the electron and hole 
continuity equations: 

(38) V 2 u + ci (p — n + N) = 0 

(39) V 2 n — VnVu — nV 2 u = cj^- 

dt 

(40) V 2 p + VpVu + pV 2 u = c 3 ^ 

at 

where u is the normalized electrostatic potential, n and p are the electron and hole 
concentrations, N is a background concentration, and ci,c 2 ,c 3 are physical constants [13], 

Given a rectangular mesh that covers a two-dimensional slice of a MOSFET, a common 
approach to spatially discretizing the device equations is to use a finite-difference formula 
to discretize the Poisson equation, and an exponentially-fit finite-difference formula to 
discretize the continuity equations [13]. On an jV-node rectangular mesh, the spatial 
discretization yields a differential-algebraic system of 3 N equations in 3 N unknowns. 

The convolution SOR method was implemented in the WR-based device transient 
simulation program WORDS [14]. WORDS uses red/black block Gauss-Seidel WR, where 
the blocks correspond to vertical mesh lines. The equations governing nodes in the same 
block are solved simultaneously using the first order backward-difference formula. The 
implicit algebraic systems generated by the backward difference formula are solved with 
Newton’s method, and the linear equation systems generated by Newton’s method are 
solved with sparse Gaussian elimination. 

The three MOS devices of Figure 6 were used to construct six simulation examples, 
each device being subjected to either a drain voltage pulse with the gate held high (the D 
examples) , or a gate voltage pulse with the drain held high (the G examples) . All examples 
ranged from low to high drain current, and in the G examples, the gate displacement current 
was substantial because the applied voltage pulses changed at a rate of .2 ~ 2 volts per 
picosecond. 


device 

description 

mesh 

unknowns a 

5 v 

h i _r 

kar 

abrupt junction 

19 x 31 

1379 


ldd 

lightly-doped drain 

15 x 20 

656 

< > 

soi 

silicon-on-insulator 

18 x 24 

856 

2.2 microns 



0 psec 


512 psec 


Fig. 6. Description of devices and illustration of the drain-driven karD example. 
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Figure 7 shows the convergence of the six examples as a function of iteration for WR, 
ordinary waveform SOR (using the pointwise optimum parameter), and the convolution 
SOR algorithm. The convolution SOR sequence u>[m] was calculated by linearizing (38)- 
(40) about the initial condition, estimating the spectral radius of the iteration matrix as a 
function of z , applying Theorem 5.5 and inverse transforming. Both overrelaxation methods 
were applied only to the potential variable u. All simulations began with 64 initial WR 
iterations, and used 256 equally-spaced timesteps. In Figure 7, convergence was measured 
using the terminal current error. 

Despite the nonlinearity of the semiconductor equations, the convolution SOR algo- 
rithm converged substantially faster than either WR or ordinary waveform SOR, demon- 
strating the robustness of the approach. 



iterations iterations 




Fig. 7. Terminal current error of the six examples as a function of iteration for WR (dashed), ordinary 
waveform SOR (dotted), and convolution SOR (solid). 








Conclusion 


In this paper, a new waveform overrelaxation algorithm was presented and applied 
to solving the differential-algebraic system generated by spatial discretization of the time- 
dependent semiconductor device equations. In the experiments included, the convolution 
SOR algorithm converged robustly, and substantially faster than ordinary WR. 

The author would like to acknowledge extensive conversations with his advisor, 
Professor Jacob White, and also thank Professors Alar Toomre, Donald Rose, Paul 
Lanzcron, Andrew Lumsdaine and Olavi Nevanlinna for many valuable suggestions. 
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SUMMARY 


A flux-difference splitting type algorithm is formulated for the steady Euler equations on un- 
structured grids. The polynomial flux-difference splitting technique is used. A vertex-centered finite 
volume method is employed on a triangular mesh. 

The multigrid method is in defect-correction form. A relaxation procedure with a first order 
accurate inner iteration and a second-order correction performed only on the finest grid, is used. A 
multi-stage Jacobi relaxation method is employed as a smoother. Since the grid is unstructured a 
Jacobi type is chosen. The multi-staging is necessary to provide sufficient smoothing properties. 

The domain is discretized using a Delaunay triangular mesh generator. Three grids with more or 
less uniform distribution of nodes but with different resolution are generated by successive refinement 
of the coarsest grid. Nodes of coarser grids appear in the finer grids. The multigrid method is 
started on these grids. As soon as the residual drops below a threshold value, an adaptive refinement 
is started. The solution on the adaptively refined grid is accelerated by a multigrid procedure. 
The coarser multigrid grids are generated by successive coarsening through point removement. The 
adaption cycle is repeated a few times. 

Results are given for the transonic flow over a NACA-0012 airfoil. 


THE SPACE DISCRETIZATION 


Flux-Difference Splitting 


Figure 1 shows part of an unstructured triangular grid. In the vertex-centered finite volume 
method nodes are located at the vertices of the grid. Every node has a control volume, constructed 
by connecting the centers of gravity of the cells surrounding the node. To close the control volumes 
on the boundary, the midpoints of the boundary edges are chosen as vertices of the control volumes. 

To define the flux through a side of a control volume, use is made of the flux-difference splitting 
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Figure 1: Vertex-centered discretization. 


principle. The Euler equations in two dimensions take the form 

dU df dg n 

dt dx Oy ' 

where U is the vector of conserved variables, / and g the Cartesian flux vectors, given by 


( 1 ) 



pu 

pu 2 +p 


puv 

puH 


,9 = 


pv 

puv 
pv 2 +p 
pvH 


( 2 ) 


where p is the density; u and v are the Cartesian velocity components in the x and y directions, 
respectively; p is the pressure; E = p /( 7 — 1 )p + tt 2 /2 + v 2 /2 is the total energy; H = ^pj{y — 1 )p + 
u 2 /2 + v 2 /2 is the total enthalpy; and 7 is the adiabatic constant. 


For the side ab of the node i and node j control volumes (figure 1), the flux-difference can be 
written as 

ij = (n x Afij + A <7 jj) A s’ij , (3) 

where A / i> j and A g iti denote the differences of the Cartesian flux vectors, n x and n y are the compo- 
nents of the unit normal to the side ab in the sense i to j, and A Sij is the length of the side. The 
differences of the Cartesian flux vectors are 


A fi,j — fj fi , A^, -, — g 3 f/i, (4) 

where /»,<?» and f 3 , gj are the flux vectors calculated with the flow variables in node i and node j, 
respectively. * — - • ■- * - - 

The flux-difference defined by equation (3) can be written as 
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A Fij — A it j(Uj C/i)Asjj — AijAUijAsij. 


( 5 ) 



To define the discrete Jacobian A, the polynomial flux-difference splitting is used. This splitting 
technique was introduced by the second author [1]. Pull details are given in [2, 3]. The polynomial 
flux-difference splitting is a Roe-type technique , i.e. it satisfies the primary requirements formulated 
by Roe [4], However, it is simpler. Its simplicity follows from dropping the secondary requirement 
of having a unique definition of averaged flow variables. This secondary requirement defines the 
original Roe-splitting within the class of methods allowed by Roe’s primary requirements. However, 
the secondary requirement is not necessary. The precise splitting used is not really relevant for the 
method we describe here. Therefore, we do not detail the method any further. The only important 
result is equation (5). 

The discrete Jacobian A t J in equation (5) has real eigenvalues that are discrete analogs of the 
normal velocity through the side of the control volume, and the normal velocity plus and minus the 
velocity of sound. The Jacobian also has a complete set of eigenvectors. These properties are a direct 
consequence of the hyperbolic character of the Euler equations with respect to time. The Jacobian 
matrix A can be written as 

A = R A L, (6) 

where A denotes the eigenvalue matrix and where R and L denote the right and left eigenvector 
matrices in orthonormal form. The matrix A can be split into positive and negative parts by 

A + = RA + L , A~ = RA~L . (7) 


Upwind Flux Definition 


For the side ah of the node i and node j control volumes (figure 1), the first order upwind flux is 
defined by 


where 


F 1 


= 2 ^* + Fj) ~ 2 ^^ — 


Fj = (n x /j + n y gi)As l j, Fj — (n x fj + n y gj)Asij. 


( 8 ) 


Using equation (5), equation (8) can be written as 

Fy = AC/,,, A S(J . (9) 

This way of writing the flux shows the incoming wave components. 

In order to define a second-order flux, the second part in the right hand side of equation (8), which 
contains the positive and negative parts of the flux-difference, is decomposed into components along 
the eigenvectors of the Jacobian, according to 



where r n and l n are right and left eigenvectors of A associated to the eigenvalue A n . r n and l n are 
components of R and L, respectively. By denoting the projection of A U itj on the n th eigenvector by 




= r 

l *,3 


A Ui 




the parts of the flux-difference become 


( 11 ) 


AF ij = E *8 <, = E AF 8 ■ 


The second-order flux is then defined by 


( 12 ) 


= 2 (Fi + F ^ 




+2 


Z_ ' 

n 


*d 


2r 




x n± 

where A F are limited combinations of the flux-difference components A F n± and the shifted differ- 
ences A F , in the sense of i for positive components and in the sense of j for negative components. 
The limiter used here is the minmod-limiter. The shifted flux-difference components are constructed 
based on shifted differences of conservative variables. These are obtained by extending the line seg- 
ment i , j into the adjacent triangles and constructing intersections with the opposing sides (point 
and point j\ in figure 1). The shifted flux-differences are defined by 


AF n * = r" A?* a n± As •• = r n A n± l n A U n± A« 

'*,3^1,} U t,3 ~ T i,j A i,j l i,j ^ U i,j i 


(14) 


where A U*f denotes the shifted differences. The employed technique to define the second-order flux 
commonly is called the flux-extrapolation technique. This concept was introduced by Chakravarthy 
and Osher [5]. For examples illustrating the quality of this second-order formulation on structured 
grids, the reader is referred to [2, 3, 6j. 


Boundary Conditions 


The examples to follow are either channel flows or flows around airfoils. The internal-type flows 
have solid, inlet, and outlet boundary conditions. The external-type flows have solid boundaries and 
far-field boundaries. Figure 2 gives an example of an external flow. The grid generation is detailed 
in a section below. The far-field boundary is a hexagon with sides 100 chord lengths away from the 
airfoil. 

Inflow and outflow boundary conditions are imposed through the definition of the boundary flux. 
The boundary flux is calculated by equation (8) with the flux Jacobian A itj determined with the values 
of the variables of the boundary node and the difference A Uij taken as the difference between values 
in the boundary node and in a ficticious node. The variables in this ficticious node are calculated 
by the classic extrapolation procedure. At a subsonic inlet, the Mach number is extrapolated while 
stagnation conditions and flow direction are imposed. For outflow, the stagnation properties and 
flow direction are extrapolated while the Mach number is imposed. 

At solid boundaries, impermeability is imposed by setting the convective part of the flux equal to 
zero. Thus a special flux definition is used for the boundary edges of the control volumes of boundary 
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Figure 2: Triangulation around an airfoil with a hexagon far-field boundary 100 chords away from 
the airfoil. The large distance between boundary and airfoil makes that the airfoil itself is not visible 
on this figure. 


nodes. For a point on the boundary (it in figure 1), the flux through the boundary side can be written 
as 


F»J, = K, (15) 

where F[, is the flux calculated with the variables in the node i/, and where the convective part of 
the flux is set equal to zero. To give the same appearance to a boundary flux as to an interior flux, 
all flux expressions can be written as 

F itj = Fi + A^AUijAsij + S.O., (16) 


where S.O. denotes the second-order correction. For a boundary node the point j does not exist. This 
can be introduced by taking the values of the variables in the ficticious node j equal to the values of 
the variables in the node i. So, the first order difference of the variables AUij and the second-order 
correction term S O. in the r.h.s. of equation (16) vanish. The matrix Ajj in equation (16) is then 
calculated with the values of the variables in the node i. The impermeability is introduced in the 
term F». As will be discussed in the next section, the matrix A^ at a solid boundary plays an 
important role in the relaxation method, although it is multiplied with a zero term. 

For an external-type flow, fluxes through edges on the far-field boundary are defined by equation 
(9). The Jacobian is calculated with the values in the boundary node. The values in the fictitious 
outside node are taken from the uniform far-field flow. By the flux difference splitting this procedure 
is equivalent to the use of Riemann invariants. 
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THE MULTI-STAGE RELAXATION METHOD 


In our earlier multigrid formulations for steady Euler equations using structured grids, the Gauss- 
Seidel relaxation method was always used [2, 3, 6]. Gauss-Seidel relaxation was preferred to Jacobi- 
relaxation because it has much better smoothing properties (effectiveness of the coarse grid cor- 
rection) and much better speed of convergence (effectiveness of the relaxation method itself). For 
structured grid applications the sequential Gauss-Seidel relaxation is very natural. However, it is 
not simple to use a sequential relaxation method on an unstructured grid, since this requires the 
construction of paths through the grid. On an unstructured grid, a much more natural method is a 
simultaneous method instead of a successive one. A simultaneous relaxation method, like the Jacobi 
relaxation, also has the further advantage of being easily vectorizable and parallelizable. The only 
drawback is that a simultaneous relaxation method, at least in its basic form, is much less effective 
than a sequential method. 

To repair this, we bring multi-staging into the Jacobi method in the same way as multi-staging 
is used for time-stepping methods and we use the optimization results with respect to smoothing 
known for time-stepping schemes. This was first suggested by Morano et al. [7], but not worked out in 
detail. A more detailed analysis on the possible multigrid performance was made by the authors [8] . 

We do not intend here to enter a principal discussion on the possibilities of multi-stage Jacobi 
relaxation. We only want to illustrate that it can be used and that it is sufficiently effective, certainly 
on unstructured grids, to be attractive in a multigrid context. We choose here a priori the defect 
correction multigrid procedure as was used in [2, 3, 6]. This means that the second-order part of 
the flux (S.O. in equation (16)) is updated only on the finest grid and is frozen on all other grids. 
We choose this configuration mainly for simplicity. The inner, multigrid, iteration cycle is then 
based on a linear discretization scheme for the partial differential equations. As a consequence of the 
linearity, this cycle can be rigorously optimized. The outer, defect correction, cycle does not involve 
any parameters and does not require optimization. In the outer cycle, a second-order accurate flux 
definition is used. The second-order part of the flux is highly non-linear due to the TVD-limiter. 
It is certainly possible to use second-order operators in the multigrid formulation as shown in [7] 
and [8], but it is not clear how to optimize. Therefore, the difficulties associated with non-linear 
discretization schemes in the multigrid formulation are avoided and only first order discretizations 
are considered. 


For the time-dependent Euler equations, the first order discrete set of equations associated with 
the node i reads 

Vol,^‘ + £ Arjfj, - U t )A SiJ = 0, (17) 

J 

where the index j loops over the faces of the control volume and the surrounding nodes and where 
V oli is the area of the two dimensional control volume. A single-stage time-stepping method with 
local time-stepping applied to equation (17) gives 



(£/»+' 


- un + £ 

j 


UP)As u = 0, 


(18) 


where the superscript n denotes the time level. 
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Using increments 8Ui = {/, n+1 — (/”, equation (18) is written as 


(^) su ‘ + £ = «• 


(19) 


The Jacobi- relaxation applied to the steady part of equation (17) reads 

(20) 

3 

Using increments 8U l = (7 i n+1 — [/", this gives 

(- E su, + E - t7?)As M = 0. (21) 

The 4x4 matrix coefficient of 8Ui in equation (21) is non-singular. In equations (18) to (21), the 
matrices A~j are at the time or relaxation level n. The difference between (single-stage) Jacobi 
relaxation equation (21) and single-stage time-stepping equation (18) is seen in the matrix coefficient 
of the vector of increments 6Ui. In the time-stepping method, the coefficient is a diagonal matrix. In 
the Jacobi method, the matrix is composed of parts of the flux-Jacobians associated with the different 
faces of the control volume. The collected parts correspond to waves incoming to the control volume. 
In the time-stepping, the incoming waves contribute to the increment of the flow variables all with 
the same weight factor. In the Jacobi relaxation the corresponding weight factors are proportional 
to the wave speeds. As a consequence, Jacobi relaxation can be seen as a time-stepping in which all 
incoming wave components are scaled to have the same effective speed. This is very important for 
the optimization in the sequel. 


For a node on a solid boundary, an expression similar to equation (21) is obtained provided that 
for a face on the boundary the flux expression of equation (16) is used and that the difference in the 
first order flux-difference part is introduced as U? — U? +1 ,( similar to the term U™ — U? +1 which is 
used for a flux on an interior face). In order to avoid a singular matrix coefficient of the vector of 
increments in equation (21), this special treatment at boundaries is necessary. A boundary node can 
then be updated precisely in the same way as a node in the interior. 

To bring in multi-staging is now very simple. For instance, a three-stage Jacobi relaxation is given 
by 


with 


u. 


Uo = 

u? 
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Uo + ot\ 
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The parameters oti,a 2 and a 3 are to be chosen. The increment 6U\ is obtained from the single- 
stage Jacobi relaxation method. The method is converted to a multi-stage time-stepping if <5 U\ is 
replaced by the increment obtained from the single-stage time-stepping method with C FL-number 
equal to 1. In the sequel, we use the optimization results obtained by Van Leer et al. [9]. For three- 
stage relaxation, the parameters are = 0.1481 03, ot> = 0.40 0:3 and 0:3 = 1.5. Since by the Jacobi 
relaxation all wave components are scaled to have the same speed, the tuning of the smoothing is 
correct for all wave components. 

To illustrate the performance of multi-stage Jacobi relaxation, we consider a test problem in which 
no adaption is used. Figure 3 shows the test geometry discretized with two unstructured grids with 
10557 and 2657 nodes. The grids have a more or less uniform distribution of the mesh size. Two 
coarser grids (not shown) with 683 and 182 nodes are also used. The bump has a height of 4% of its 
chord. The channel height is equal to the bump chord. 

Figure 4 shows the iso-Mach line results obtained for a transonic case with an outlet Mach number 
of 0.79 and for a supersonic case with inlet Mach number of 1.4 

Figure 5 shows the convergence history. A full multigrid method with W-cycles is employed. 
The calculation starts from uniform flow on the coarsest grid. On each level one three-stage Jacobi 
relaxation is done. One Jacobi stage and one residual evaluation for all nodes on the finest grid, are 
both counted as one work unit. The corresponding work on a coarse grid is counted proportional to 
the number of nodes on that grid. A second-order correction is also counted as one work unit. The 
work associated to intergrid transfer is neglected. Coarse grid points also appear in the finer grids. 
So injection is used for function value restriction. Defect restriction is obtained through weighting 

by 

Re ^ 1 ( Ri .I y- _g A 
Voli - 2 y Voli ny Volj j ’ 

where the node i on the finer grid corresponds with the node i' on the coarser grid and where j loops 
over the n neighboring nodes of i. Ri stands for the residual as given by equation (22) and Voli is 
the volume of the control volume. The prolongation is constructed by direct transfer of the coarse 
grid correction to the corresponding points in the finer grid. Fine grid points that do not correspond 
to a coarse grid point are given a correction value based on an average of the corrections at the 
neighboring nodes that do appear on the coarser grid. 

The convergence history of the maximum residual is shown in figure 5. The convergence rate is 
about one order of magnitude per 250 work units. This is an acceptable performance. The best 
convergence rate on structured grids using Gauss-Seidel relaxation on comparable problems is about 
one magnitude per 50 work units [2, 3, 6, 10]. Due to the use of a Gauss-Seidel type smoo ther the 
convergence rate in the structured case is better. 

THE MESH GENERATION 

The automatic triangulation of an arbitrary set of points can be achieved using Delaunay trian- 
gulation. Robust algorithms to construct this triangulation in 2D are available. 
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Figure 5: Convergence behaviour for the second-order defect correction scheme (TVD minmod- 
limiter) using three-stage Jacobi relaxation. Transonic (T) and supersonic (S) test cases. 

The grid generator is built with two basic algorithms. The first one is based on the advancing 
front method, developed by Tanemura et al. [11], simplified to work on a set of nodes which are 
already connected. The nodes have to fie on the boundary of a polygon. With this algorithm a 
given, not necessarily convex, polygon can be triangulated. No extra nodes are added. The result 
is a constrained Delaunay triangulation of the given polygonal boundary. This algorithm was used 
for the generation of the initial triangulation of the domain. The domain was defined by discretizing 
the boundaries based on local curvature and local grid spacing. 

The second basic algorithm used in the grid generation allows the addition of a node to an existing 
triangulation. The new node is connected to three or four existing nodes: three nodes if the new 
node lies inside a triangle, four nodes if the new node lies on an edge. The triangulation is made 
Delaunay by the use of a diagonal swapping algorithm [12], 

For a sequence of multigrid grids, our formulation of the intergrid transfer operators demands 
that a node of a coarse grid appears in the finer grids. Two strategies can be used to satisfy this 
condition. A sequence of grids can be constructed by refining a coarse grid (adding nodes), or by 
coarsening a fine grid (deleting nodes). To add a node to a triangulation the second basic algorithm 
is used. To delete a node, the node (together with all the connected edges) is removed from the 
triangulation and then the remaining cavity is retriangulated using the first basic algorithm. 

To allow stretching in the construction of the grid, the above described Delaunay triangulation is 
performed in a transformed space. Every node has it’s own transformation parameters, ( 6 , s x > , s y >). 
The transformation is a rotation over the angle 8 , followed by a rescaling of the new x' and y' axes 
by s x > and s y >. The criteria by which the Delaunay triangulation is built are based on transformed 
properties. 

For the first test case, the bump in a channel of figure 3, multiple grids are built from coarse 
to fine by adding nodes. An initial grid is generated based on the discretization of the boundary. 
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Transformation parameters are calculated from boundary mesh spacing and boundary edge direction. 
The initial grid is refined to form the coarsest grid in the computation. Interior nodes are added in 
the middle of selected edges, until all edges satisfy an edge length criterion in the locally transformed 
space. Finer meshes are generated by taking a smaller value of the threshold on the edge length. 

The final mesh is smoothed, moving all the nodes of all the grids. It is possible that by this 
smoothing the triangulation is no longer Delaunay. Therefore, an edge swapping algorithm is used 
to restore the Delaunay property on all the grids. 

For the second test case, the NACA-0012 airfoil, the first multigrid sequence is built in the same 
way. Then a solution is determined. A new mesh is now generated based on flow adaptive refinement, 
in which three refinement criteria are used. The first criterion is based on the pressure difference 
over an edge. If |p» — P 3 \Lij > L re f \p ma x ~ Pmin\/C p , then the edge i,j is refined by placing a new 
node into the center of the edge. In this formulation p t is the pressure at node i and L itj is the length 
of the edge. The variables at the right hand side are minimum or maximum values taken over all 
the nodes. L re f is a problem dependent reference length and C v is the sensitivity constant for the 
pressure criterion. This criterion triggers shockwaves and stagnation regions. The second criterion is 
based on the entropy difference over an edge. It is similar to the first criterion, however p is replaced 
by the entropy s. This criterion triggers shock waves and tangential discontinuities. Further, edges 
are refined where the flow passes through Mach number equal to 1 from supersonic to subsonic flow. 
This criterion also triggers shock waves. 

The adaptively refined mesh can be included directly in the multigrid sequence but large parts of 
the mesh might coincide with the next coarser mesh. This results in a degradation of the multigrid 
performance. Therefore, it is better to generate coarser meshes by coarsening the new mesh. The 
simple criterion, removing a node only if no neighbors of this node are removed, is used. The number 
of nodes is decreased by roughly 1/4 by doing this. This coarsening is repeated twice to get a coarser 
mesh. 


MULTIGRID RESULTS USING ADAPTIVITY 


Starting from the grid shown in figure 2, two successive stages of refinement were done using a 
threshold value on the edge length in the locally transformed space. Figure 6 shows part of the final 
grid generated this way. On this grid a solution is calculated using multigrid. By the use of the 
foregoing adaption criteria, this mesh is refined. Three coarser grids are generated from this grid 
with the above described coarsening procedure. A solution is calculated using the four grids. The 
adaption is repeated three times. The solution on the final mesh is given in figure 7 in the form of 
iso-Mach fines. The NACA-0012 profile has an angle of attack of 1.25 degrees and the Mach number 
of the incoming flow is 0.80. In total, 4 fine meshes (with for every fine mesh 2 or 3 coarser meshes) 
were used with 1302 (figure 6) (594,300), 2243 (1256,720,421), 2834 (1577,911,522), 3236 (figure 7) 
(1782,1010,580) nodes. Figure 8 shows the final grid structure in the shock regions. Figure 9 shows 
the final grid structure in the leading edge and the trailing edge regions. The refinement of the fourth 
grid is shown in the left part of figure ?? where only the edges selected for refinement are shown. 
The right part of figure 10 shows the pressure distribution over the a irfoil. 
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Figure 10: Left : Edges selected for refinement from third to fourth grid; Right : Pressure distribution 
over the airfoil. 



1000 2000 WUi^i 

Figure 11: Convergence behaviour for the second-order defect correction scheme ( TVD minmod 
limiter) on the NACA-0012 test case, using three-stage Jacobi. Logarithm of the residual as function 
of local work units. Multigrid with fine grids 1302 (a), 2243 (b), 2834(c), and 3236 (d) nodes. 
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Figure 11 shows the convergence behaviour. For each multigrid phase, the work unit is defined 
based on the number of nodes in the finest grid during this phase. This way of representing the 
convergence is chosen to demonstrate the mesh independency of the convergence. 
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SUMMARY 


Two-level domain decomposition methods are developed for a simple nonconforming 
approximation of second order elliptic problems. A bound is established for the condition number 
of these iterative methods, which grows only logarithmically with the number of degrees of freedom 
in each subregion. This bound holds for two and three dimensions and is independent of jumps in 
the value of the coefficients. 


INTRODUCTION — 

The purpose of this paper is to develop domain decomposition methods for second order elliptic 
partial differential equations approximated by a simple nonconforming finite element method, the 
nonconforming Pi elements. We consider a variant of a two-level additive Schwarz method 
introduced in 1987 by Dryja and Widlund [1] for a conforming case. In these methods, a 
preconditioner is constructed from the restriction of the given elliptic problem to overlapping 
subregions into which the given region has been decomposed. In addition, in order to enhance the 
convergence rate, the preconditioner includes a coarse mesh component of relatively modest 
dimension. The construction of this component is the most interesting part of the work. Here we 
have been able to draw on earlier multilevel studies, cf, Brenner [2], Oswald [3], as well as on recent 
work by Dryja, Smith, and Widlund [4]. Our main result shows that the condition number of our 
iterative methods is bounded by C (1 + log(H/h), where H and h are the mesh sizes of the global 
and local problems, respectively. We also note that this bound is independent of the variations of 
the coefficients across the subregion interfaces. 

The face based and the Neumann-Neumann coarse spaces, that we are introducing, have the 
following characteristics. The nodal values are constant on each edge (or face) of the subregions 
and the values at the other nodes are given by a simple but nonstandard interpolation formula. 
Thus the value at any node in the interior of a subregion is a convex combination of three (or four) 
values given on the boundary, in case of triangular (or tetrahedral) substructures. We note that an 

’This work was supported by a graduate student fellowship from Conselho Nacional de Desenvolvimento Cientifico 
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U. S. Department of Energy under contract DE-FG02-92ER.25127. 
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important difference between nonconforming and the conforming case is that there are no nodes at 
the vertices (or wire basket) of the subregions. 

We note that ideas similar to ours have been used recently in other studies of domain 
decomposition methods for nonconforming elements; cf. Cowsar [5,6] and Cowsar, Mandel and 
Wheeler [7]. In particular, an isomorphism similar to ours was independently introduced by 
Cowsar. We point out that by using these isomorphisms, we can analyze any nonconforming 
version of domain decomposition methods which have already been analyzed for conforming cases. 
In this paper, we focus on the case where there are great variations in the coefficients across 
subdomains boundaries for both two and three dimensions. We define and analyze new coarse 
spaces and obtain condition numbers with just one log factor. 

A short version of this paper was entered into Copper Mountain student competition in 
mid-December 1992. The present paper is a slight modification of a technical report [8]. 

DIFFERENTIAL AND FINITE ELEMENT MODEL PROBLEMS 


To simplify the presentation, we assume that Q is an open, bounded, polygonal region of 
diameter 1 in the plane, with boundary dCl. In a separate section, we extend all our results to the 
three dimensional case. 

We introduce a partition of fi as follows. In a first step, we divide the region O into 
nonoverlapping triangular substructures Q*, i = 1, • • • , N. Adopting common assumptions in finite 
element theory, cf. Ciarlet [9], all substructures are assumed to be shape regular, quasi uniform and to 
have no dead points;i.e. each interior edge is the intersection of the boundaries of two triangular 
regions. We can show that the theory also holds if we choose nontriangular substructures, where 
the boundary of each substructure is a composition of several curved edges, and each curved edge is 
the intersection of two substructures. Naturally, we need assumptions related to the quasi 
uniformity and nondegeneracy of this partition. Initially, we restrict our exposition to the case of 
triangular substructures since the main ideas are seen in this case. This partition induces a coarse 
mesh and we introduce a mesh parameter H := max{/fi, • • • , Hn} where Hi is the diameter of f 2*. 

We denote this triangulation by T H . Later, we extend the results to nontriangular substructures. 

In a second step, we obtain the elements by subdividing the substructures into triangles in such 
a way that they are shape regular, and quasi uniform. We define a mesh parameter h as the 
diameter of the smallest element and denote this triangulation by T h . Similarly, we assume the 
triangulation T h does not have any dead points. 

We study the following selfadjoint second order elliptic problem: 

Find u € Hq(Q), such that 

a(u,v) = f(v), V v € Hq(Q.) , (1) 


! $ *£ v . 'f' ? an - 


544 


where 


a(u, v) = f a(x) V« * Vv dx and f{v ) — f fv dx for / G L . 

J n 

We assume that a(x) > a > 0 and that it is a piecewise constant function with jumps occurring 
only across the substructure boundaries. This includes cases where there is a great variation in the 
value of the coefficient a(x). We remark that there is no difficulty in extending the analysis and the 
results to the case where a(x) does not vary greatly inside each substructure. 

Definition 1 The nonconforming Pi element spaces (cf. Crouzeix and Ramart [10]) on the h-m.esh 

and H-mesh is given by 

V h {t> | v linear in each triangle T € T h , 

v continuous at the midpoints of the edges of T h , and 
v = 0 at the midpoints of edges of T h that belong to d£l} , 

and 

V H := {v | v linear in each triangle T G T H , 

v continuous at the midpoints of the edges of T H , and 
v = 0 at the midpoints of edges of T H that belong to cX2}. 

These spaces are nonconforming; in fact V H <JL V h and V h <£ 

Let E be a region contained in Q such that dE does not cut through any element. Denote by Vfe 
and T h | £ the space V h and the triangulation T h restricted to E, respectively. 

Given u € Vjj, we define the discrete weighted energy semi norm by: 

Mjft. (£) : =4(w> u )» 

a v ' 


where r 

a|(u,u)= X) / a (x) Vu-Vudx. 
T€T'‘| £ - /T 


( 3 ) 


In a similar fashion, we define the inner product a(f(u,v) and the semi norm f° r 

u,v e In order not to use an unnecessary notation, we drop the subscript £1 when the 

integration is over fi and the subscript a when a — 1. 


The dis crete problem associated with (1) is given by: 

Find u eV h , such that 

a h (u,v) = f{v),Vve V*(ft). ( 

Note that | • \ H i ^ is a norm, because if |«Ih> h (n) = then u is constant in each element. By 
the continuity at the midpoints of the edges and the zero boundary conditions, we obtain u = 0. 
Note also that / is a continuous linear form. Therefore, we can apply the Lax-Milgram theorem 
and find that there exists one and only one solution of the discrete equation (4). 


We also define the weighted L 2 norm by: 

IMIll(E) := [ a(x) \u(x)\ 2 dx for u G (V h + V H + L 2 a ) |g. ( 5 ) 

a J £ 
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We introduce the following notation: x ^ y , f X g and meaning 

x < Cy, f > eg and cv < u < C v, respectively. 

Here C and c are positive constants independent of the variables appearing in the inequalities and 
the parameters related to meshes, spaces and, especially, the weight a(x). 



Sometimes it is more convenient to evaluate a norm of a finite element function in terms of the 
values of this function at the nodal points. By first working on a reference element and then using 
the assumption that the elements axe shape regular, we obtain the following lemma: 

Lemma 1 For ue Vjg, 

IMIl;; a (E) ^ h 2 a(T) (ii 2 (A/i) + u 2 (M 2 ) + (6) 

re t» | £ 

and 

^ 53 °( T ) {( u ( M i) ~ w(M 2 )) 2 (7) 

re T» lt 

+(u(M 2 ) - u(M 3 )) 2 + ( u(M 3 ) - u(M 1 )) 2 }, 
where Mi,M 2 ,M 3 are the midpoints of the edges of the triangle T as in Figure 1. 


An inverse inequality can be obtained by using only local properties. It is easy to see that for 
ue V h , 


\ u \Hf h 1 h a ||u|| L 2 


( 8 ) 


ADDITIVE SCHWARZ SCHEMES 


We now describe the special additive Schwarz method introduced by Dryja and Widlund; see 
e.g. [11,12]. In this method, we cover Cl by overlapping subregions obtained by extending each 
substructure Cli to a larger region We assume that the overlap is where 5, is the distance 
between the boundaries dCli and dCl^, and we denote by 6 the minimum of the S{. We also assume 
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that dCl'i does not cut through any element. We make the same construction for the substructures 
that meet the boundary except that we cut off the part of that is outside of Q. 

For each fl', a P\ nonconforming finite element subdivision is inherited from the h-mesh 
subdivision of Q. The corresponding finite element space is defined by 

V* := {u | u € V h , support of v C fi*}, i = !,■•• ,N. (9) 

The coarse space V 0 h C V h (tt) is given as the range of I h H (or I h H ) where the prolongation 
operator I# (or /^) will be defined later. 

Our finite element space is represented as a sum of JV + 1 subspaces 


v h = v£ + v? + • • • + v£. 

(10) 

We introduce operators Pi : V h —> V/*, i = 0, • • • , N, by 


a h (PiW,v) = a h (w,v), VuG V* , 

(11) 

and the operator P : V h —> V h , by 


P = P 0 + P x + • ■ ■ + P N . 

(12) 

In matrix notation, Po is given by 


Po = I h H (I h H T KI h H )- 1 I h H T K 

(13) 


where K is the global stiffness matrix associated with a h (-, •)• 


We replace the problem (4) by 

N 

Pu = g, g = ^Z9i where gi = PiU. ( 14 ) 

i=0 

By construction, (4) and (14) have the same solution. We point out that g t can be computed, 
without knowledge of u, since we can find & by solving 

a h (g u v) = a h (u,v) = f(v), Vv € V* . (15) 

The operator P is positive definite and symmetric with respect to a h (-, •). We can therefore solve 
(14) by a conjugate gradient method. In order to estimate the rate of convergence, we need to 
obtain upper and lower bounds for the spectrum of P. A lower bound is obtained by using the 
following lemma: cf. Zhang [13,14]. 

Lemma 2 Let Pi be the operators defined in equation (11) and let P be given by (12). Then 

a h (P~ l v,v) = min ^2a h {v u Vi), £ V* 1 . (16) 

Therefore, if a representation v = £ can be found such that 

j2* h (vi,Vi)<CZa h (v,v), Vve V\ (17) 

i=0 

then 

A min(P) > C» 2 . 
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An upper bound on the spectrum is obtained by bounding 

a h (Pv, v ) = a h (P 0 v, v ) + a h (P x v , v) H + a h (P N v, v ) (18) 

from above in terms of a fl (v,v'). Using Schwarz’s inequality, the fact that the P, are projections, 
and that the maximum number of regions that intersect at any point is uniformly bounded, it is 
easy to show that the spectrum of P is bounded above by 

max{#(i:pe fi') + l}. 


PROPERTIES OF THE Pi NONCONFORMING FINITE ELEMENT SPACE 


We first define two local equivalence maps in order to obtain some inequalities and local 
properties for our nonconforming space. Through these mappings, we can extend some results that 
are known for the piecewise linear conforming elements to our nonconforming case. 

We use a bar to denote conforming spaces. Let V 2 |fj. be the conforming space of piecewise linear 
functions in fli, where the h/2-mesh is obtained by joining midpoints of the edges of elements of 


We define the local equivalence map M.i : — ♦ V* |fj., as follows: 

Isomorphism 1 Given u € V% t , define u - Mw by the values ofu at the three sets of points (cf. 
Figure 2.): 

i) If P is a midpoint of an edge of a triangle in T h , then 

u(P) := u(P). 


ii) If P is a vertex of an element in T h and belongs to the interior ofQ i} and the Tj are 
the elements that have P as a vertex, then 


u(P ) := mean of 

Here u\ t, (P), is the limit value of u(x) when x G Tj approaches P. 

Hi) If Q is a vertex ofT h \o^., and Qi and Q r the two midpoints of T h \ d ^ t that are next 
neighbors of Q, then 


u(Q ) := 


I 

I QlQr 


u(Qi) 


1 W - V * I 

Here \Q r Q\ is the length of the segment Q r Q. 


I QlQr 


HQ r). 


Case ii) is illustrated in Figure 2., where 





t=l 
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Case Hi) is required in order to have property (21), which will be very important in our analysis. 



Lemma 3 Given u G U h |n,, let u G 9™™ h V « = M ' u - Then 

1^1 Ha 1 (Hi) X ) 

Nlliin.) “ IMUiin,) ' 


(19) 

( 20 ) 


[ u (.) ds = f »(«) ds. < 21 ) 

JdHi JdUi 

Here | • | H i (ni) is the standard weighted energy semi norm, for conforming functions. 

Proof. We first note that we have results similar to (6) and (7) for the conforming space V - , 

where now M u M 2 and M 3 are the vertices of a triangle in T*. In order to prove (19), we compare 
(7) with the analogous formula for the piecewise linear conforming space. 


For instance (see Figure 2.), 

of u restricted to the union of the 


The right hand side can be controlled by the energy semi norm 
triangles T 7 , T 8 and Tg. 


We also prove that if we take next two neighboring vertices of T » in the interior of Qi, the 
energy semi norm can be bounded locally. If a{x) does not vary a great deal we can work with 
weighted semi norms. Using the fact that our arguments are local, it is easy to obtain the upp 

bound of (19). 


The lower bound is easy 


to obtain since the degrees of freedom of V h are contained in those of 


VT 


Similar arguments can also be used to obtain (20). 

Finally, it is easy to see that (21) follows directly from iii) even if the refinement is not 
uniform. □ 
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We define another local equivalence map Mf : V h \^ —* , by: 

Isomorphism 2 Given u € V»|fl, and an edge E of 90,, define u = Mfu by the values ofu at the 
three sets of points (cf. Figure 2.): 

i) Same as step i) of Isomorphism 1. 

ii) Same as step ii) of Isomorphism 1. 

in) If V is a vertex “T |^ i and an end point of E, and V r the midpoint of T il | ^ that is 
the next neighbor ofV, then 

u(V) := u{V r ). 

iv) If Q is a vertex of T h \o^. and we are not in case Hi), then 

Using the same ideas as in Lemma 3, we can prove: 

Lemma 4 Given u G V h \( t ., let u G Vi\ given by u = Mfu. Then 

|w|^(n,) - , 

IMUito) - IMU^no , 


and 


j E u(s)ds = j E u(s) do. 


( 22 ) 

(23) 

(24) 


THE INTERPOLATION OPERATOR 


Let v G V h and let Py be the midpoint of the edge Eij common to Qi and Qj . 

Definition 2 The Interpolation operator Iff : V h -> V H , is given by: 

VlfvKP,,) := plj f^ <fe=j^j J Eu ®ln,(*) dx. (25) 

The second equality follows from the fact that the mean of v on each edge of an element of T h is 
equal to where M x is the midpoint of the edge. It is important to note that the value of 

Vh v )\ p ij) depends only on the values of v on the interface E^. This allows us to obtain stability 
properties that are independent of the differences of a(x) across the substructure interfaces. 

Before studying the stability properties of this operator, we need two lemmas for the piecewise 
linear conforming finite element space. 

The following lemma is a Poincare-Priedrichs inequality. The idea of the proof can be found in 
Ciarlet (Theorem 6.1) [9] and in Necas (Chapter 2.7.2) [15]. 
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Lemma 5 Let T be a subset of d9 it such that T and 39* have measures of order H. Then, 

INI + V “ €H'((U). (26) 

As a consequencej if J r u(x)dx = 0, we have the Poincare inequality 

WUi(a)iff|o|».( ft )- ( 27 > 

The next lemma is a Poincare- Friedrichs inequality for nonconforming P\ elements. It is obtained 
by using Lemmas 3, 4 and 5. 

Lemma 6 Let u G H\ h {9(), where 9i is a triangular substructure of diameter 0(H). Let T be 39i 
(or an edge of 39,). Then, 

IMlWid ^ + (. f T u ( x ) dx ) 2 > Vu e ( 28 ) 

As a consequence, if f r u(x) dx = 0, we have the Poincare inequality 

IMk^tf|“k,(a>- < 29 > 

The next lemma gives an example of an operator that is L 2 a — and H\— stable. 

Lemma 7 Let u G Hl(9i), where 9i is a triangular substructure of diameter of 0(H). Define a 
linear function Uh in 9, by 

UH(Pij) -=T^-\ L u(x)dx, j = 1,2,3, (30) 

I &ij I 'tEij 

where the Eij are the edges of 9i, and Pij is the midpoint of Eij. Then, 

\v,H{Pij )\ 2 d Jp\\u\\b(Ui) + l“ltf 1 (n i )» 

1^1^1(00 d Nffi(fii)’ ( 32 ) 

and 

||«h - t*||ia(n,) ~d Hfolmm- ( 33 ) 

Proof. Consider initially a region 9 with a diameter of 1. Using that \Eij\ - 0(1), the 
Cauchy-Schwarz inequality and a trace theorem, we have 

\u H {Pij )\ 2 X I [ u{x) dx I 2 x Hull^a^) 

J Eij 

d IN^syj ^ INlWo) - H fi ll£ 3 («y) + 

We obtain (31) by returning to a region of diameter H. 
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Note that for any constant c 


(34) 


|w/r(P»i) — UH(P i2 )\ 2 + \Uff(Pi 2 ) — Un{Pi3,)\ 2 + |uh(-F» 3) — Wff(Pii)| 2 

— 11“ — c lltfi(fl.)' 

By choosing c = u(Pn ) and T = Eu, we can apply Lemma 5 and obtain the //^stability (32). 

We now prove the Instability. Since u — uh has mean zero on dtii, we can apply the Poincare 
inequality (27) and obtain 

II W “ Utf || L2(fi i ) <H\U- U H Itfiffli) • 

Using the first part of this lemma, we obtain the Instability (33). □ 

The next lemma shows that the interpolation operator /^/defined by (25), is locally I 2 — and 
H\ —stable. 


Lemma 8 Let u G V^fi). Then uh = Iffu satisfies the following properties 

\ u h\hI w (n ( ) ^ Mf/jjn,) , (36) 

and 

||«jf - w|Ui(nd ^ ■H'l«ltf < i /l (n i ) , i = l,---,iV. (37) 

Proof. Let ujj = Iff u and let u G H l ( flj) be given by u = Aif a u and let uh(Ph) be given by 
(30). Using the properties (24) and (25), we have 

uh{P% i) = uh(Ph)- (38) 

Therefore, by (38), (31) and Lemma 4, we have 

\uh(Ph)\ 2 = |wtf(-Pii)| 2 ^ jpW^Wi 2 ^) + (39) 

^ jp\\ u \\b(ni) + Nff>(n ; )- 
We also obtain the same estimate for lit# (P^)! and |uh(Pj 3 )|- 

The rest of the proof is similar to that of Lemma 7. We now use the Poincare inequality for 
nonconforming elements. □ 


THE PROLONGATION OPERATOR 


In this section, we introduce several prolongation operators and establish that they are stable. 
The range of each of these operators will serve as a coarse space in our algorithms. 
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Definition 3 The Prolongation Operator 7# : V H -* V h , is given by: 

i) For all nodal points P ofT h that belongs to an edge Eij common to and % let 
{Ih u h)(P) '= v-H(Pij), where Pij is the midpoint of the edge E y. 

ii) Given I^uh at the nodal points ofT = U idtli from, i), let be the 

Pi -nonconforming harmonic extension inside each fi*. 

It is easy to check that u h = I h H u„ € A disadvantage of step ii) is that we have to solve 

exactly a local Dirichlet problem for each substructure in order to obtain the harmonic extension. 
Other extensions can be used, which we call approximate harmonic extensions. They are given by 
simple explicit formulas and have the same L\ and H\ h stability properties as the harmonic one. 



Our first construction is a natural generalization of the partition of unity introduced by Dryja 
and Widlund in [11]; this partition of unity will provide the basis functions of our approximate 
extensions. Let Pj, j = 1,2, 3, be the midpoints of the edges of Oi, and let V, be the vertex of fl t 
that is opposite to Pj. Let C be the barycenter of the triangle i.e. the intersection of the line 
segment connecting Vj to Pj. 

Extension 1 The construction of an approximate harmonic extension is defined by the following 
steps (see Figure 3.): 

i) Let 

i i(C) := ^ {uh(Pi) + u H (P 2 ) + uh{P3)}- 

U 

ii) For a point R that belongs to a line segment that connects C to a vertex V jt let 

u(R) := u(C). 

Hi) For a point Q that belongs to a line segment connecting C to Pj, define u(Q) by 
linear interpolation between the values u(C) and, Ufj {Pj) . i.e by 

u(Q) := A(Q)u(C) + (1 - X(Q))u H {Pj )• 

Here A (Q) —distance{Q,Pj)Jdist,ance{C,Pj). 
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lv ) For a point S that belongs to the line segment connecting the previous point Q to a 
vertex V*, with k ^ j, let 

u(S) := u{Q). 

v) Finally, let IjjUh = I^u, where 1^ is the interpolation operator into the space V h that 
preserves the values of a function at the midpoints of the edges of the elements. 

Note that the function u just constructed is continuous except at the vertices Vj of ft*. The step 
i) can be viewed as emulating the mean value theorem for harmonic functions. However, near the 
vertices, u is a bad approximation of the harmonic extension. We know that the local behavior of 
the harmonic extension near a vertex Vj depends primarily on the boundary values in the vicinity 
of Vj. For instance, if ujj(P 1 ) = 0, uh(P 2 ) = 0, and uh(P 2 ) = 1, we should obtain — 0 near V 2 ; in 
addition, by using symmetry arguments, we should have u h ~ 1/2 for points near V x that lie on the 
bisector that passes through Vi . With this in mind, we now construct an alternative approximate 
harmonic extension. 

We change notation in order to be able to use Figure 3. Now let C be the point where the three 
bisectors intersect. 

Extension 2 The construction of the approximate harmonic extension, is defined by (see Figure 


i) Same as Step i) of Extension 1. 

ii) Define u(Vj) = | w(Pj). For a point R that belongs to a line segment, connecting 
C to Vj, define u(R) by linear interpolation between the values u(C ) and u(Vj). 

Hi) Same as Step in) of Extension 1. 

iv) For a point S that belongs to a line segment connecting the previous point Q to 
h j> u(S) is defined by linear interpolation between, the values u{Q) at Q and 
f(Q,j, k) at Vfc. Here, 


= X(Q)u(V k ) + (1 - X(Q))u(Pj). 


v) Same as Step v) of Extension 1. 


A disadvantage of this extension is that we cannot just work in a reference triangle, since the 
angles are not preserved under a linear transformation. This is similar to the fact that under a 
linear transformation a harmonic function does not necessarily remain harmonic. We can construct 
other approximate harmonic extensions which combine the properties of the two extensions, given 
so far, and working, for instance, with the barycenter C as in Extension 2 and replacing the weight 
1/2 in Step ii). 

The next lemma shows that the extensions given above have quasi-optimal energy stability. 
Using ideas of Dryja and Widlund[ll], we prove the following lemma. 


554 



Lemma 9 Let uh € V H (Cl). Then 


\Ih u h\hi ffc (n,) d (1 + log{H/h))~- \u H \ H i w(n .) (40) 

and 

\\Ih u h - uaWmsii) d H \uh \h] H (n,)- ( 41 ) 

Proof. Let 9^ G V h \n n j = 1, 2,3, be the approximate harmonic extensions constructed from the 
boundary values 0^ = 1 at the h-mesh nodes on the edge Eij, and 6^ = 0 at the other boundary 
nodes of dQ{. It easy to see that the (P h form a basis of all approximate harmonic extensions that 
take constant values on the edges of the substructure. It is easy to show that if a point x belongs to 
the interior of an element of fl*, then |V 0^(x) | is bounded by C/r, where r is the minimum distance 
from x to any vertex of flj. Note that any element that touches a vertex of fl, provides an order one 

contribution to the energy semi norm. To estimate the contribution to the energy semi norm from 

the rest of the substructure, we introduce polar coordinate systems centered at the vertices of 
Then, 

1^1^1(00 - 1 + / J h r ~ 2rdrdl P d l + log(H/h). (42) 

Since the partition of unity 0 J h forms a basis, it is easy to see that 

d (43) 

(1 + log(H/h)) {| u„(P,)\ 2 + MP 2 )| 2 + |u„(P 3 )| 2 } 
and using ideas similar to that of Lemma 7, we have 

I / h u »Ih>(0,) ^ (1 + log(H/h)) (MP.) - «h(P 2 )| 2 + 

\u„(P 1 ) - «h(Pj)| ! + I«h(P 3 ) - «h(^i)| 2 } 

x (1 + log(Hlh))\u„$ l . {a . ) . 


By construction, it is easy to see that 

\{I^u H ){x)\ < max \u H (Pi)\. 

i— i,z,o 

Therefore 

II Ih u h ~ u n\\b(n,) d^H 2 1 uh{P-, i )| 2 > 

i 

and by using (39) and (29), we obtain (41). 

Since a(x) varies little in each f 1*, these arguments are also valid for the weighted norms and we 
obtain (40). □ 
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Using Lemmas 6 and 9 and the triangular inequality, we have: 


Theorem 1 Let u € V h (Sl). Then 

II I H I h u ~ u h 2(n<) ^ H \ u \hi A (n.) (44) 

and 

\^H^h u \Hl ^ (1 + log(H/h)y Mjji fc ( n .). (45) 

Remark 1 It is easy to see that we do not need to use the fact that Uh G V# (Q); we only need to 
calculate values Vjj(Pij) by formula (25) at the midpoint, Pij of the edge Eij. The next, step is t,o 
provide the constant value Vn(Pij) to all nodes of the interface and perform, an approximate 
harmonic extension. 


Remark 2 The extensions also can be constructed for nontriangular substructures. In a first, step, 
we construct a partition of unity in ft*. This can be done by using ideas similar to those of the 
triangular case. By using the same technique as in the proof of Lemma 9, we can show that 

— (46) 


N} 


(1 + log(H/h)) Y, *(«,) K(Pij) - «« (P i0 -I))| ! 

i-i 


where PiJ and Pi(j~ i) are neighboring midpoints of edges of dQ,i and TV* is the number of edges of 
dfti. We obtain (44) by noting that each term of the sum. is bounded by \u\ 2 H i fn ,. 

a,h ^ 1 ' 


THE NEUMANN-NEUMANN BASIS 


In this section, we consider a Neumann-Neumann coarse space. This is the P\ nonconforming 
version of a coarse space studied in Dryja and Widlund [16], and Mandel and Brezina [17]. 

However, here we use an approximate harmonic extension inside the substructures. We note that 
the coarse spaces considered by these authors differ only in how certain weights are chosen. Mandel 
and Brezina use weights that are convex combinations of the coefficient a(x), while Dryja and 
Widlund use a? (x). Here we show that any convex combination of a 0 {x), for (3 > 1/2, leads to 
stability. We point out that the choice = 1/2 can be viewed as a L 2 -average, while /? = 1 is an 
average in the L l sense. 

We call the coarse space of the previous section, face based. There are some differences between 
Neumann-Neumann and face based coarse spaces. A Neumann-Neumann coarse space has one 
degree of freedom per substructure, while a face based uses one degree of freedom per edge. A 
Neumann-Neumann basis function associated with the substructure Dj, has support in fl, and its 
neighboring substructures, while a face based function basis, associated with an edge of a 
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substructure, has support in just two substructures. The face based coarse space appears to be 
more stable since all the estimates, related to the jumps of the coefficients, are tight. In the lemmas 
that we have proved for the face based methods, all the stability results were derived in individual 
substructures, while in the Neumann-Neumann case, we need to work in an extended subdomain. 



Figure 4. 


Definition 4 The Neumann-Neumann interpolation operator, Inn '■ V h 
i) For each substructure calculate the mean value on dili, i.e. 


V h , as follows: 


Hu: =m L, u{s)ds - 


Here is the length size of dQ,i. 

ii) For all nodal points P ofT h that belong to the edge E itj , let 
(Innu)(P) = (if?u)(Pij), where 


a 0 (Sli) a 0 (Hj) 

: = TgigkfraS miU Mno+^ nT) 


Here Pij is the midpoint of the edge Eij . 

in) Perform an approximate harmonic extension to define I N nU inside the substructures. 


Note that we can also calculate m jti by: 

= < 47 > 

Therefore, there exists a linear transformation 1% : V H -* V H , such that iffu = In I h u ■ The next 
lemma establishes stability properties for 1%. 
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(48) 


Lemma 10 Let u H € V H {0.) and (3 >1/2. Then 

\Ih u h\hi H ( n ,) < C{P) \uh\h^ H (q ai ) > 

and 

\\Ih u h ~ u i/IU|(n,) < C{(3) H \u h \ h i u H (n^')- (49) 

Here the extended domain Q* xt is the union of 0* and the substructures that share an edge with . 


Proof. Let us first prove the L\ stability. Note that (see Figure 4.) 

K(Po) - asn„m)f = |U„(P,) a ^ + + ; g (^ )m f. 

By using (47) and simple calculations, this quantity is equal to 


1 

+ a«(fi,)| 2 * 

l<A«<) - UH(Pik)) +]^j l (“H( p u) - u„(Pa))}+ 

a °^i) {ji|j(“H( p y) - “»«•)) - “«( p i0)}l 2 - 


Using the shape regularity of the subdomains, it is easy to see that 

a(Qj) | Utf(Pij) — (IffUH)(Pij)\ 2 ^ (50) 

, |2 , 

K(«i) + a0(Clj) I 2 l + |a«([i,j + o«(n,)| 2 1 

and using the fact that (3 > 1/2, we can bound this quantity by 

< C{(3) \uh^j\ ^O iuaj)- 

We obtain (49) by adding all the contributions (50) to the L„(f2i) norm. 

We prove (48) by using the triangular inequality, an inverse inequality, and (49). □ 

Theorem 2 Let u € V h (Q,) and (3 > 1/2. Then 

IUnnu - «IU2(n.) < C(/3) H \u\ H x , (51) 

and 

\lNNu\nl h {ni) ^ + log{H/h)f |u| w i (52) 
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Proof. Using Lemmas 9, 10 and 8, we have 

\Innu\h] h (di) ^ (1 + log(H/h))- \IhIh ^ 

C{fi) (1 + log(H/h))i |/f U| H . < 

C(fi) (1 + log(H/h))i 

The L^-stability is obtained by 

\\Innu — u|kj(n.-) — II-^awu — ^^“11^(0,)+ 

||/h / f u - Ih «llL2(n,) + II Ih u ~ u|Ui(n<) . 

and by using Lemmas 9, 10 and 8. □ 

Remark 3 We can also prove Theorem, 2 for the ease of nontriangular substructures; cf Remaiks 
1 and 2. 


THE THREE DIMENSIONAL CASE 

We show in this section that the methods developed before can be extended to three dimensions. 

For simplicity, we assume that Cl is a polyhedral region of diameter 1 in three dimensional space. 
As before, we introduce a nonoverlapping partition composed of tetrahedra of diameter of order 
H. This defines a coarse space and a triangulation T H . We further subdivide the substructures into 
tetrahedra which results in a triangulation T h and define the nonconforming Pi finite element 
spaces V h and V H as in Definition 1. Here, the continuity is enforced at the barycenter of the faces 
of the triangulations. 

The local equivalence maps are given by the following procedure. In each tetrahedral element of 
T h (cf. Figure 5.), we connect its centroid to the four vertices and to the barycenters of the four 
faces. We also connect each barycenter to the three vertices. In other words, we subdivide each 
tetrahedral element into twelve subtetrahedra. We denote this new triangulation by T h . The 
vertices of T h are the vertices, barycenters, and centroids of the elements of T h . 

Let V*|n be the conforming space of piecewise linear functions of the triangulation T h \&,- 

We define the local equivalence map fAi : V^lft, —* , as follows: 

Isomorphism 3 Given u G define u = M x u by the values of u at the. following sets of 

points: 


559 



i) If P is a vertex of an element ofT h and belongs to the interior offy, and the Kj are 
the elements in T* 1 that have P as a vertex , then 

u{P) := m,ean of u\ Kj (P). 

Here u\ Kj (P) is the limit value of u(x) when x € Kj approaches P. 

ii) If P is a barycenter of a triangle in then 

u(P) := u(P). 


in) If P is a vertex of a triangle in and Tj, j = 1, * * • , Np, are the triangles of 

i that have P as a vertex, then 

“OT-En j&rMCi). 

k = 1 I U j=l I 

Here Ci and |7i | are the barycenter and the area of the triangle T{. respectively. 


It is easy to check that the Lemma 3 holds, if we replace V h /%. by V h \^.. 


We define another local equivalence map Aif : by: 

Isomorphism 4 Given u £ V h and a face F of dCli, define u = Aifu by the. values of u at the 
following sets of points: 


i) Same as step i) of Isomorphism 3. 


ii) Same as step ii) of Isomorphism. 3. 


in) Let P be a vertex of a triangle in T h \ d ci, that belongs to dF, and let. T j; 
3 = N p, be the triangles of T h \ F that, have P as a vertex. Then 


* F e 


I T k 


n{P) := E , n f 


«(<*). 


iv) Let P be a vertex of a triangle in T h | an< that does not belong to dF, and let. T jy 
3 = 1 >”', N P, be the triangles ofT h \ F that have P as a vertex. Then 


N P 

«*(*) := E 


|r*| 


fc; i uf-'i n 


u(Ci). 


It is easy to check that Lemma 4 holds, if we replace V h/ %. by V^., and let the faces play the 
role previously played by the edges. 

Let v € V h and let Cy be the barycenter of the face F y common to and Cl,. 


560 



Definition 5 The interpolation operator I f? : V h -*■ V H , is given by: 

(^)(C„):=^ J F< » /*/ W*)*' 

where \Fij\ is the area of the face Fij. 

Using the same ideas as in two dimensions, we can prove lemmas analogous to Lemmas 5-8. 

The prolongation operator 1% : V H -> V h , is defined as in the two dimensional case. In a first 
step, we define (I%u H )(P) := u*(C y ) for all barycenters F of triangles in T h \ Fij . Finally, we 
perform a P r nonconforming harmonic or approximate harmonic extension. 

We describe the three dimensional version of Extension 1. This is a generalization of the 
partition of unity introduced by Dryja, Smith, and Widlund [14]. Let Cj, j = 1, • • * ,4, be the 
barycenters of the faces Fj of dQ u and let Vj be the vertex of fij that is opposite to Cj. Let C the 
centroid of Q*, i.e. the intersection of the line segments connecting the V 3 to the Cj. Let E jk , 
k = 1, 2, 3 , be the edges of dFj. 

Extension 3 The construction of an approximate harmonic extension I%u H is defined by the 
following steps (see Figure 5.): 

i) Let 

u(C) := j 5> h (C,). 

ii) For a point Q that belongs to a line segment connecting C to Cj, define. u(Q ) by linear 
interpolation between the values u(C) and ujj{Cj), i.e. by 

u(Q) := A(Q)«(C) + (1 - A(Q))u*(C,-). 

Here X (Q) =distance(Q, Cj)/distance(C, Cj). 

Hi) For a point S that belongs to any of the three triangles defined by the previous Q, and 
the edges Ejk, k = 1 , • • • , 3, let 

u(S) := u(Q). 

iv) Finally, let I^u H = hu, where I h is the interpolation operator into the space V h that 
preserves the values of a function at the barycenter of the faces of elements in T . 


We can also construct an approximate harmonic extension similar to that of Extension 2. This 
gives a better approximate harmonic extension near the edges. 

The prolongation operator I )j in three dimensions has the same stability properties as in the two 
dimensional case, i.e. Lemma 9 still holds. 
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The idea of the proof is the following. Consider the case where u/f(fij) is given by uh(Ph) = 1 
and Un{Pi 2 ) = Uh{Pk ) = 0. This gives the partition of the unity introduced by Dryja, Smith, and 
Widlund [4]. The energy semi norm of uh is of order H. 

Let 9jf = We note that |V^ x (x)| is bounded by C/r, where r is the distance to the 

nearest edge of fit. The contribution to the energy semi norm from the union of the elements with 
at least one vertex on the edge of the substructure can be bounded by CH, since the extension is 
given by a convex combination of the boundary values. To estimate the contribution to the energy 
from the rest of the substructure, we introduce cylindrical coordinates using the appropriate 
substructure edge as the 2 -axis. Integrating |V0j*(a:)| 2 over this region, we find that it is bounded 
by C{l + log(H/h))H. 

To prove Lemma 9 for a general uh, we use the same ideas as for two dimensions. Similarly, we 
can extend the results to nontriangular substructures and to the Neumann-Neumann case. 



MAIN RESULT 

In this section, we consider the Schwarz method introduced in the previous sections and prove 
the following result. 

Theorem 3 The operator P of the additive Schwarz algorithm, defined by the spares Vq and Vfi , 
satisfies: 

k(P) K (1 + log(f)) (1 + |). 
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Here k{P) is the condition number of P. Therefore, if we use a generous overlapping , then 

K{P)H + log{j). 

Proof. The proof of this theorem is essentially the same as in the case of a conforming space; see 
Dryja and Widlund [12]. 

As we have seen before, the upper bound is very easy to obtain. The lower bound is obtained by 
using Lemma 2. We partition the finite element function it € Vj, as follows. We first choose 
ito = Jf u or i-e. apply a face based or Neumann-Neumann interpolation operator. Let 

w = u — uq. The other terms in the representation of u are defined by tx* = Ih (8,w) , i — 1, • • • , N. 
Here Ih is the linear interpolation operator into the space V h that preserves the values at the 
midpoints of the edges of the elements and {$i} is a partition of unity with 8, € C 0 (^,) 

E0<(*) = 1. 


For a relatively generous overlap of the subdomains, these functions can be chosen so that ^78, is 
bounded by C/H. By using the linearity of I h , we can show that we have a correct partition of u. 
In order to estimate the semi norm of u it we work on one element K at a time. We obtain 

^ \Qi w \Hi h (K) + ^ \Ih((Qi ~ h (K) 


Here 8, is the average value of 8, over K. It is easy to see, by using the inverse inequality (8), that 

\h({0i - 8i)w) \]ii h (K) ^ h ~ 2 ll J ft((0» ~ Oi) w )\\hw 

We can now use the fact that on K, 9i differs from its average by at most Ch/H. After 
summing over all elements of Vt\, we arrive at the inequality 


Kltfj jn') ^ Htfj.hW) + H 2 IMIi,2 ( n'.)- 


We sum over all i and use that each point in ft is covered only a fixed number of times and 
obtain a uniform bound on Cq. We conclude the proof by estimating the two terms of 




by |u|^i The bounds follow by using the stability results of Theorem 1 or 2. 

For the case of small overlap, the proof is similar to that of the case of piecewise linear 
conforming space considered in Dryja and Widlund [12]. □ 
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SUMMARY 

An automatic version of the multigrid method for the solution of linear systems arising from the 
discretization of elliptic PDE’s is presented. This version is based on the structure of the algebraic 
system solely, and does not use the original partial differential operator. Numerical experiments show 
that for the Poisson equation the rate of convergence of our method is equal to that of classical 
multigrid methods. Moreover, the method is robust in the sense that its high rate of convergence is 
conserved for other classes of problems: non-symmetric, hyperbolic (even with closed characteristics) 
and problems on non-uniform grids. No double discretization or special treatment of sub-domains 
(e.g. boundaries) is needed. When supplemented with a vector extrapolation method, high rates 
of convergence are achieved also for anisotropic and discontinuous problems and also for indefinite 
Helmholtz equations. A new double discretization strategy is proposed for finite and spectral element 
schemes and is found better than known strategies. 


1 INTRODUCTION 


The multigrid method is a powerful tool for the solution of linear systems which arise from elliptic 
PDE’s [1] [2]. This is an iterative method, in which the equation is first relaxed on the original fine 
grid in order to smooth the error; then the residual equation is sent to a coarser grid to be solved 
there and to supply a correction term. Recursion is used to solve the coarser grid problem in the 
same way the original equation is handled. In order to apply this procedure, the differential operator 
has to be discretized on all grids, and restriction and prolongation operators have to be defined 
in order to pass from coarse to fine grids and vice versa. The multigrid method works well for the 
Poisson equation in the square, but difficulties arise with non-symmetric problems, indefinite problems 
and problems with discontinuous coefficients or non-uniform grids. In those cases, it is not easy to 
discretize the differential operator on coarse grids and to generate the restriction and prolongation 
operators appropriately. Some suggestions about how to handle discontinuous coefficients are given 
in [4] and [5], while the singularly perturbed case is discussed in [6]. Slightly indefinite problems are 
discussed in [7], These approaches involve special treatment of problems according to the original 
PDE, and the need for a uniform approach is not yet fulfilled. 
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In principle, the multigrid procedure is problem- dependent, and cannot serve as a “black box” 
that solves every problem. Special attention has to be given to the neighborhood of the boundary 
and to lines of discontinuity. In [8] [9] [10] an algebraic multigrid method for symmetric problems 
is developed. Though this method is automatic in the sense that it depends on the linear system 
of equations solely, it suffers from the disadvantage of the coefficient matrices for coarse grids being 
of 9-diagonal type, even when the original matrix is of 5-diagonal type [11]. An algebraic version 
of multigrid which overcomes this difficulty is presented in [12], and generalized to nonsymmetric 
problems in [13]. This version, however, does not improve the classical multigrid in cases of indefinite 
or hyperbolic problems and of non-uniform grids. 

The algorithm which is presented in this work, and which we denote Multi Block Factorization 
( MBF ) (the reason for this terminology will become clear in the next section!, gives a uniform 
approach that enables one to handle the above difficulties. It relies on the algebraic system of 
equations solely, and not on the original PDE. The operators for coarse grids, as well as restriction 
and prolongation operators, are automatically defined when the coefficient matrix is given. It seems 
to be more robust than the classical multigrid method, as it solves non-symmetric problems (even 
hyperbolic or with closed characteristics) as quickly as classical multigrid solves the Poisson equation. 
Moreover, it is applicable to non-uniform grids as well, and does not require any special treatment 
of sub-domains. For anisotropic, discontinuous or indefinite problems MBF by itself is not always 
sufficient. However, it can cope with such problems successfully when it is applied in conjunction 
with vector extrapolation methods. In our numerical examples in the present work we have employed 
the Reduced Rank Extrapolation ( RRE ) of [17] and [18]. This and other related methods have been 
surveyed in [19] and their analysis provided in [20], [21] and [22]. The numerical implementation that 
we have used is the one given in [23]. 

The MBF algorithm is described in Section 2. In Section 3 numerical results are presented. In 
Section 4 the algorithm and the numerical results are discussed. 


2 DESCRIPTION OF ALGORITHMS 


2.1 Definition of the TBF Method 


Let A be an N x N matrix. Let x and b be iV-dimensional vectors. Consider the problem 

Ax = b (1) 

An iteration of the Two-Block Factorization (TBF) method is defined by 

TBF(x in ,A,b,x 0Ut ) : 

x 0 = x in 

Zj + i = Xi — Zi(Ax{ — b) 0 < i < i\ 

Qe = R(Ax h - 6) (2) 

*j,+l = ~ Pt 

Xi+i = Xi — Z,(Axi — b) ii < * < *i + »2 

X 0U i = + + i 

where the Z{ are some preconditioning operators, i\ and ii are nonnegative integers denoting the 
number of presmoothings and post-smoothings respectively and i?, P and Q are operators to be 
defined later. Define 

e »n = x in e out = x out x 
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Then 


Consequently, 


<l + «2 <1_1 

c ou t = i n {i-zMv-pQ-'RAmv-wv* 
«=«! + ! <=0 


Q = RAP =$■ Cp«t — 0 =>■ %out x. 


(3) 


In the sequel, practical choices for Q will be considered. An iterative application of TBF is given 
by 

To = 0, 1 = 0 

while ||residuaJ|| > e 

TBF{xi,A,b, x j+ i) 
i *— i + 1 

endwhile 


2.2 Definition of the MBF Method 


The Multi-Block Factorization (MBF) method is a modification of the TBF method, in which the 
system (2) is not solved directly, but is divided into several independent subsystems, which l . JF?. , 

directly or recursively by MBF itself. For simplicity, we first write the algorithm for tridiagonal 
systems. The operators P, R, Q and D will be defined later. 


MBF(x in ,A, 6,x out ) : 
if A is diagonal 

x out = A~ l b 
otherwise: 

D = diag(d\ t . . . ,djv) 

MBF (0, D~ l Q , D~ l R{Ax in -b),e) 

X gut — Xi n — Pe. 


We turn now to the more general definition of MBF. First we note that if there exists ^ subset 
of coordinates of * which are independent of the others, then there exists a projectionn onothe 
sub-space spanned by those coordinates such that (IUII)n* = Ub. In the 
MBF such sub-systems are solved directly, provided this can be done easily. The co-subsystem 

solved recursively. 


MBF(x in , A, 6, x out ) : 

1. If A is diagonal, set x 0tt t = A~ l b and stop. 

2. If A includes an independent tridiagonal subsystem (IIAII)nx = 116, solve it directly: nx ou t = 
(nAn)' 1 !!^ If not, set II = 0. 


3. 


yo = {I ~ n)x in 

1 = (j-n)6 

y»+i = y< - {I - n )Zi(Ayi -b) 0<i<i\ 

D = diag(di y ... ) dx) 

MBF (0, D~ l Q, D~ l R(Ayi i — 6), e) ( 4 ) 
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y.j+i = yh -(i- n)Pe 

Vi + 1 = J/i - (/ “ n)Zj(^y,- - 6) *! < t < ii + i 2 

{I ~ n)Xo U t = J/lj+ij+l • 

Trivially, one can replace action (4) by variant a : 

MBF (0 ,QD~ l ,R(Ay h -b),e) 

e = D -I e 

or variant b : 

MBF (0, D-'^QD- 1 ' 2 , D~ l ' 2 R(A yil - 6), e) 
e = D~ 1/2 e. 

An iterative application of MBF is given by 

xo = 0, i = 0 

while ||resz'c?ua/|| > e 

MBF(xi,A,b, x, + i) 

i <— i + 1 

endwhile 


2.3 The Tridiagonal Case 


Let J denote an identity operator. Suppose N = 2" for some positive integer n, and let A be a 
tridiagonal M - matrix satisfying diag(A) = I. Let M(N) be the permutation matrix which reorders 
the variables of N - dimensional vectors such that odd numbered variables appear in a first block and 
even numbered variables appear in a second block. Define 

M 0 = M{N ), Aq = A. 


Then for some bidiagonal matrices Bq and Co we have 

= c 0 /° ) M « = Ra,iQa,iPX\, 

where 

Raa = ( -C 0 I ) M °’ Qa ' 1 = ( 0 I - Co Bo ) ’ Pa - 1 = M ° T ( 0 f ° ) • 

Note that Qa,\ is the Schur complement for A. 

For i = 1,2,..., let denote an identity operator of order N — 2 n- \ Let A/,- be the N x N 
permutation matrix that reorders the coordinates i = N - 2 n_ ' + 1 , . . . , JV, of an ./V- dimensional 
vector in the above manner, that is, order odd coordinates in a first block, then even coordinates in 
a second block. In fact, 


Ii 


0 



For 1 < i < n, define 


Ii 0 0 \ ( Ii 0 0 

Pam i = Mj I 0 I -B, , Rami = I 0 ' 0 I M ‘- 


0 0 

0 

Qa,«+ 1 = -RA.i+l^i-PA.i+l = I 0 I 


0 ~Ci I 

0 
0 


0 0 I-CiBi 
D A ,i+i = diag{Q A ,i+ i), 

. Ii + 1 0 0 \ 

Aj+l = •Da!«+ 1^'A.‘+ 1 = "^ii+l 


Mi+i, (i <n — 1). 


0 J Fi+i 

0 c i+ , / / 

The last equality sign implicitly defines £ i+ i and Ci+i- For variants a and b, the above definition is 

modified to read j 

>1»+1 = VX,i+ l-^A.i+l 


and 


■A»+l — ^A./t-lQ^.'+^A.i+lJ 


- 1/2 


respectively. 

Lemma 1 For all variants, the matrices Q, and A, are tridiagonal M-matrices. 


Proof: The lemma follows from the definition by induction on i. □ 
Lemma 2 The TBF method, when applied with 

Q = Qa, 1, P = Pa, 1, R = Aa.i, h = 0, h = o 


is a direct method. 


Proof: Since 


Q = Qa,i = Ra, iA 0 P a ,i = RAP 


the lemma follows from equation (3). □ 

The even numbered variables may be viewed as coarse-grid points. Then Q is a coarse grid 
operator, R is a fine-to-coarse restriction and P is a coarse-to-fine prolongation. 


Lsninid. 3 The A IBF method applied with the operators 

A = A,- 1, Q = Qa,„ D = Da, i? P = -Pa.w # = ^a,« 


on the i th call to the MBF procedure, is a direct method. 

Proof: Note that on the (n + l)“ call to the MBF procedure, the tfe!u[vl\ent 

the MBF procedure is a direct solve. By induction on i = n, . . . , 1, all calls to MBF are equivalen 

to calls to TBF , hence are direct solves. □ 

In fact, in the tridiagonal case the MBF method is equivalent to the cyclic reduction method. 
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Note that if the matrices P and R are defined to be the rectangular matrices 

Pa,%+i = M? ^ j ^ ,Raj + i = ( —C{ I ) Mi, 

Qa,\+ 1 = i2A,«+l>ii-PA,i+l — I — CiBi , 

At.i+i = diag(Q Ai i +1 ), 

i4,+i = -Dx.i+l^A.i+l = Mi+1 ^ C'+\ } fl ) 

th® algorithm will still be applicable. As a matter of fact, the only difference between this method 
and the former is that in the present method we are not taking advantage of the known residuals on 
the odd numbered variables when making the coarse grid correction. Hence, if those residuals are 
zero, which may happen as a result of red-black presmoothing, the present method serves as a direct 
method, just as the former. 


2.4 The Separable 2-Dimensional Case 


Let 5 = (s,j) and T = (tij) be matrices of order M and N 


respectively. Define 


*i,iT 


9 i,mT 


SoT = 




9 m,iT 


9 m,mT\ 


Actually, o denotes the tensor product. Suppose A is of the form 

A = To Es,o + Ex,o o S 

where T and S are tridiagonal scaled M- matrices and Es, o and E T o are diagonal matrices. For 
example, if 

T = S = tridiag(- 1/2, 1, -1/2), E T , o = E s , o = I 

then A represents a central discretization of the Poisson equation on a square with Dirichlet boundary 
conditions. 

Suppose T and S have the same order N , which is a power of 2. As in the previous section, we define 
the matrices T<, S„ Rs,i, Pt,u Ps,i, Qt,u Qs,i, Dr,i and Ds,i- For any matrix B = (&i,j)i<,,j<jv 

N 

rowsum(B) = diag(J2 b itj ) l<i<N . 

i=i 
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For 0 < i < n, define 


Et = rowsum(RT,%+i)ET,i’ rowsum(Px t i+i) 

Es = row3um(Rs,i+i)Es t i" row$um(Ps,i+\) 

Pam i = <Pr,t+i 0 Ps,.+i 
Ram i — Pt,»+i 0 -fts,*+i 
Dami = -^r.i+i 0 -Ds.i+i 

■E'X.i+i = •^r ) 1 i+i-^'X 

^5,«+l = DsMlEs 

QaMI — -Rr.i+l^i-PT.i+l 0^5 + ^° i^S.i+lSi-Ps.t+l 

Ai+i = 

= Ti+i 0 £?S,i+l + -^T.i+l 0 ‘S’i+l- 

In the definition of TBF, we take 

Q = Qa,u P = -Px.b -ft = -ftA.i- 

Note that if we would eliminate the word row sum in the above definition, the TBF method would 
be direct, due to equation (3). Nevertheless, for smooth vectors, multiplication by a positive matrix 
B is well-approximated by multiplication by row sum (B). Since T< is an M-matrix, Hr.i+i and Pj.i+i 
are positive. Consequently, if we use presmoothing and post-smoothing, i.e. ii > 0 and 12 > 0, 
then the error is smooth, so equation (2) would give a good corrector for the current approximation. 
Moreover, the use of the above row-sum approximation makes the system (2) much easier to solve 
than the original system, since it includes 4 independent subsystems: 

1. A diagonal system connecting variables which axe odd in both directions, i.e., variables that 
correspond to odd rows of both S and T (fine grid system). 

2. A tridiagonal system connecting variables which are odd in the first direction and even in the 
second, i.e., variables that correspond to odd rows of T and even rows of S (half coarse system). 

3. A tridiagonal system connecting variables which are even in the first direction and odd in the 
second, i.e., variables that correspond to even rows of T and odd rows of S (half coarse system). 

4. A penta-diagonal system connecting variables which are even in both directions, i.e., variables 
that correspond to even rows of botn S and T (coarse grid system). 

Only the solution of the last subsystem is expensive. The MBF method solves this subsystem 
recursively by the same procedure. On the z*^ call to MBF (0 < z < n.) the operators used are 

A = Ax- 1 , Q = Qa^x E = -Da,*) P — P a,» a-nd R = Ra ,<• 

Since A n is diagonal, the method is well-defined. The total work of MBF for a problem with N 2 
variables is 

ti>(JV 2 ) = 0(N 2 ) + u>(JV 2 / 4) = 0{N 2 ) + 0(N 2 / 4) + w{N 2 / 16) = • • • = 0{N 2 ) 

Note that if we had 

(rou>sum(i?T 1 i+i) • rowsum(PTM^) 0 I = 1 0 -^s,i+i 
I o (rowsum(RsMi) ' rowsum(PsMV )) = ^T,»+i 0 I 
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then 


Qa,i+i — Dx,i+\Ti + i o Es + Ex o DsmiSm 
= T{+i o Es.t + Ex,i o Si+i 

is already in the scaled form, on which the MBF algorithm may act recursively. Hence the scaling 
by D~ x in action (4) of Section 2.2 is not needed. Of course, these equalities cannot hold exactly, but 
if they hold approximately, we can avoid scaling. Especially in the non-separable case, where scaling 
is impossible, action (4) has to take place without scaling by Z? -1 (instead of actual scaling, we would 
prefer in that case to keep diagonal matrices multiplying the difference operators in each of the space 
directions). Hence we would like to assume the above equalities at least for all former levels, that is, 
at the i ih call to MBF , for all 0 < j < i — 2. We call that variant “noscal”. 

If we use the rectangular matrices P and R of the last part of Section 2.3, we get a variant of 
MBF which we call variant notri. In this variant, only the last of the four subsystems described 
above is solved. It assumes that the other tridiagonal subsystems affect only the smoothness of the 
error. The row-sum operation, however, is still performed on the original triangular matrices and not 
on the newly defined rectangular matrices. 

Instead of the operators Ai and Qa.,% defined above, one may use difference operators which arise 
from the original PDE. If the algorithm is to be automatic, all such operators have to be of the 
same type (i.e. central or upwind) as the original fine-grid operator. (Nevertheless, in Section 3 
we will see that for some non-symmetric problems this condition has to be violated for the sake of 
convergence.) We use this strategy with the rectangular matrices of variant notri ; our version is then 
different from classical multigrid only in the choice of restriction and prolongation operators. Note 
that the row-sums computed in the MBF algorithm are usually 4. Instead of the multiplication by 
these row-sums, one may divide the residual by 4 before action (4) of Section 2.2. Then one gets an 
algorithm which is equivalent to that of [12] for the Poisson equation, and is close to that of [5] for 
other problems. We denote that strategy MGF (Multigrid -f Factorization). It should be kept in 
mind that when applying this strategy one must use 2" — 1 grid points on the finest grid and 2 q - 1, 
1 < g < n for coarser grids in order to conserve uniformity. Here the even points, which are taken as 
coarse grid points, are always internal points of the original stencil. For 2 q point grids, on the other 
hand, the last fine grid point appears as a last grid point in all grids. Hence, coarse grids are biased 
towards the boundary. For our method MBF, on the other hand, stencils of both 2 n points or 2 n — 1 
points may be used. This is critical for implementation to problems on general regions, where grid 
lines may contain variable numbers of grid points (see Section 4). 

As mentioned above, the MGF method requires division of residuals by 4 before action (4) takes 
place. Sometimes it is better to scale the discrete operators on all grids instead of dividing the 
residuals by this factor. Actually, for the Poisson equation both manners are equivalent: suppose Ai 
has a coarse step-size H = 2 h\ then normalizing A\ to have the same diagonal entries as Ao = A 
amounts to multiplication of A\ by the factor 4, which is equivalent to the division of the residual 
in action (4) by that factor. Nevertheless, for differential equations that include derivatives of orders 
other than 2, this variant is not equivalent to MGF. We call it MGN (Multigrid + Normalization). 

The generalization of the MBF method and of the other multigrid versions to nonseparable 
problems is straightforward. A tensor product by an N x N diagonal matrix is to be replaced with 
a multiplication by an N 2 x N 2 diagonal matrix. 

Another generalization of MBF is to non-rectangular domains. This is also straightforward, since 
a line containing an odd number of points may be divided into two sets, one containing odd points 
and the other containing even points. Then one of those sets is considered as a coarse grid, and is 
divided recursively in the same way. A similar strategy may be used in the other space direction. 
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3 NUMERICAL EXPERIMENTS 


In this section, the MBF method is compared to other multigrid versions. The problems solved 
are of the type 

Lu(x,y) = f(x,y) (x,y) € (0,1) 2 
u(x,y)-xy (x,y) G d[0, l] 2 

The equations are discretized via a 3-point central difference scheme. For MBF, the number of grid 
points in each space direction is N = 2 n . For other multigrid strategies, however, the number of 
grid points in each space direction is N — 1 = 2 n — 1; otherwise, coarse grids are biased towards the 
boundary (see Section 2.4), and the convergence is slow. 

For MBF, we have used variant “ notri ” of Section 2.4, in which tridiagonal subsystems are not 
solved. The main variant, which involves the solution of those subsystems, was found to be at most 
as effective as “notri”. 

The main variant of Section 2.4 was used for the hyperbolic, the non-uniform, the strongly in- 
definite and the discontinuous problems (the latter when applied with a red-black smoother). For 
other problems we have found that variant “noscal” described there (in which scaling of coarse-grid 
operators is omited), performs equally well. Hence we have chosen to use this simpler version rather 
than the main variant. Indeed, it was found that for most problems its performance was very close 
to that of the main variant. For the hyperbolic problem, however, its performance was twice as slow. 

The smoother of the error in all grids was the one provided by the I LU (1,1) iteration of [24] 
[25] [26] or the red-black (RB) iteration. This determines the operators Z, of Section 2.2 to be the 
preconditioners for the 1X17(1,1) or RB iteration, respectively. These smoothers were found to be 
superior to the Jacobi and damped-Jacobi smoothers. One presmoothing and one post-smoothing is 
performed. The initial guess is random. Double precision arithmetic is used. 

The integers in the following tables present the number of iterations needed to reduce the / 2 norm 
of the residual by 10 6 . The maximum norm of the error was also computed, and its rate of convergence 
was close to that of the residual. 

In conjunction with the MBF iteration, we have used the computer code of [23] that implements 
the vector acceleration RRE that was mentioned in the introduction. The RRE acceleration was 
employed in cycling mode, by restarting it after every 10 iterations until convergence. The results of 
this are compared to those provided by the MBF iteration without acceleration denoted by NONE. 

We have also examined the classical multigrid versions mentioned at the end of Section 2.4. This 
strategy is denoted by the superscript D . The number of grid points in each space direction is 
N — 1 = 2 n — 1. In most of the problems, the MGF and MGN versions of Section 2.4 are equivalent. 
Where this is not the case, we mention explicitly which of the two versions has been employed. 

For comparison, we also checked the performance of a method which does not involve any multigrid 
strategy. This method is the Modified ILU method of [29] with the optimal parameter of [30], used 
as a preconditioner for RRE. It is denoted by MILU. 

In the following tables, when a method converges very slowly we denote it by “slow” , and when 
a method diverges, we denote it by “div”. 


3.1 The Poisson Equation 


In table 1 we present the results for the Poisson equation. 
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RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

N 

ILU 

ILU 

RB 

RB 

TUP 

ILU U 

RB U 

WP 

MILU 

32 

4 

5 

5 

7 

4 

5 

5 

7 

19 

64 

4 

5 

5 

7 

4 

5 

5 

7 

27 

128 

4 

5 

5 

7 

4 

5 

5 

7 

42 


Table 1: Results for the Poisson equation 
MBF and the classical multigrid version perform equally well for this problem. 


3.2 Poisson Equation on a Tchebycheff-Type Grid 

In table 2 we present the results for the Poisson equation, discretized via central differences on 
the 2-dimensional grid 

P (j.fe) s ( , 1 ~ C0 2 S <^ ) i<y,Kjv 

The matrix operator for this scheme may be used as a preconditioner for a Tchebycheff-collocation 
discretization of the Poisson equation (see [31] and the references therein). 



RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

N 

ILU 

ILU 

RB 

RB 

Tup 

ILU U 

RB U 

RB U 

MILU 

32 

4 

5 

7 

10 

4 

5 

8 

11 

16 

64 

4 

6 

9 

19 

5 

6 

10 

18 

23 

128 

5 

7 

13 

33 

5 

6 

14 

28 

36 


Table 2: The Poisson equation with non-uniform grid 


The superscript D refers to the MGF method of the end of Section 2.4, which is in the spirit of 
Dendy [12J. It performs equally well as MBF. 


3.3 An Anisotropic Discontinuous Equation 


In table 3 we present the results for an anisotropic equation whose coefficients are discontinuous; 


a(x)u xx + a(y)u 

yy ~ 0 


Here a(t ) is defined by 



0.01 0 < t < 0.5 
1 0.5 < t < 1 


MBF and MGF perform equally well for this problem. For both methods, N - 1 grid points were 
used in each space direction. If N = 2 n grid points are used in any of the space directions, then 
for coarse-grid problems the discontinuity lines are biased towards the boundary, and convergence 
becomes slow. 


Results similar to those of table 3 were obtained for the continues anisotropic problem 

Ujj -}- O.OlUyji — — 0 

This time, however, there was no difference in convergence rate when the number of grid points in 
each space direction was changed from N = 2 n to N — 1 = 2 n — 1, for both MBF and MGF. 
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RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

N 

ILU 

ILU 

RB 

RB 

tup 

tup 

RB U 

RB U 

MILU 

32 

5 

6 

19 

slow 

5 

6 

19 

slow 

23 

64 

7 

10 

26 

slow 

7 

10 

26 

slow 

32 

128 

9 

13 

28 

slow 

9 

13 

28 

slow 

48 


Table 3: An anisotropic discontinuous equation 


3.4 A Convection-Diffusion Equation with Circular Streamlines 

In table 4 we present the results for the convection-diffusion equation 

u xx + Uyy + 150( (y — 0.5)u* — (z — 0.5)u v ) = f 
whose characteristics are circles: 



RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

N 

ILU 

ILU 

RB 

RB 

PUP 

TUP 

RIP 

RTF 

MILU 

32 

7 

10 

12 

15 

8 

slow 

div 

div 

28 

64 

6 

10 

9 

12 

8 

slow 

div 

div 

51 

128 

6 

10 

9 

12 

8 

slow 

div 

div 

94 


Table 4: A convection-diffusion equation with circular streamlines 


Problems of the last type are widely discussed in [6]. The approach developed there requires 
special treatments and is not as automatic as ours. 


3.5 A Convection-Diffusion Equation with Radial Streamlines 


In table 5 we present the results for the convection-diffusion equation 

u xx + Uyy + 150( xu x + yu v ) = f 
whose characteristics are radial lines: 



RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

NONE 

RRE 

N 

ILU 

ILU 

RB 

RB 

ILU° 

TUP I 

RB U 

RTF 

MILU 

32 

7 

9 

slow 

div 

7 

9 

div 

div 

28 

64 

5 

6 

13 

12 

7 

8 

15 

15 

51 

128 

5 

6 

9 

10 

8 

9 

12 

15 

94 


Table 5: A convection-diffusion equation with radial streamlines 


The ^ superscript denotes here the MGF method of the end of Section 2.4.. Nevertheless, the 
purely automatic MGF version, in which all difference operators are central, diverged. To avoid 
that we had to use upwind difference schemes for all grids coarser than the original grid. This 
strategy, however, though performing almost equally well as MBF , suffers the disadvantage of not 
being automatic. Another way to overcome divergence is to use the MGN method of the end of 
Section 2.4, with the same step-size h for all grids. This strategy is non-automatic as well, and about 
twice as slow as the first one. 
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3.6 A Convection Equation 

In table 6 we present the results for the convection equation 

( y — 0.5)^ — (x — 0.5)u y = / 

whose characteristics are circles, discretized via an upwind scheme: 



RRE 

NONE 

RRE 

NONE 

N 

ILU 

ILU 

ILU 11 

mp 

32 

10 

27 

10 

slow 

64 

15 

49 

14 

slow 

128 

23 

89 

23 

slow 


Table 6: A convection equation with circular streamlines 

The superscript D refers to the MGF method of the end of Section 2.4. Its performance is equal 
to that of MBF. 

The standard ILU and the Modified ILU of [29] (on the fine grid only, without multigrid strategy) 
do not converge for this problem. All multigrid strategies with an RB smoother are very slow. 


3.7 The Helmholtz Equation 


In table 7 we present the results for the Helmholtz equation 

U xx "4” u yy “4" = f 

with = 64. The RRE method for MBF was restarted in this example after every 5 iterations. 



RRE 

RRE 

RRE 

RRE 

N 

ILU 

RB 

-mpr 

RB U 

32 

9 

14 

17 

16 

64 

9 

13 

17 

19 

128 

8 

15 

18 

20 


Table 7: The Helmholtz equation 

Without acceleration, all methods diverged. The RRE acceleration for ILU and MILU iterations 
(on the fine grid only, without multigrid strategy) also diverged. 

The superscript D denotes here the MGN version, used with a continuation strategy; that is, use 
a parameter /? smaller than that of the original PDE for grids coarser than the original grid, in such 
a way that the number h 2 3 is constant for all grids. Without this continuation strategy, divergence 
was reported. Consequently, it suffers the disadvantage of not being automatic. 

This problem is of the type of problems discussed in [7]. The projection approach given there 
requires more work and special treatment. 

For (3 > 64, the RRE acceleration for MBF seems to suffer stability problems, as the residual 
no longer decreases monotonically. A machine with higher precision (or Kacmarz smoother as in 
Section 3.8) is required. With classical multigrid, on the other hand, the acceleration is more stable: 
with the ILU smoother, it converges for f3 = 100 and N = 64 in 42 iterations. 

Note that the Helmholtz equation and the convection-diffusion equations are better-posed as the 
number of grid points increase; hence the number of iterations generally decrease. 
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3.8 Helmholtz Equation with Mixed Boundary Conditions 


The above experiments involve Dirichlet boundary conditions. In this sub-section we examine the 
Helmholtz equation with Dirichlet boundary conditions on the edges x = 0, x = 1 and y — 1, and 
with the mixed boundary conditions 

— + au = g (5) 

on the edge y = 0. 

We have repeated experiments 6.1 and 6.3 of [32], for which 0 = 100, a = lOOi, N = 31 and 
0 = 200, a = lOi, N = 63 respectively. On coarse grids, where the problem is very indefinite, we 
have used Kacmarz relaxations as a smoother. The cost of an MBF or MGF iteration was about 
10 Jacobi iterations of the original problem. With MBF , we have converged in 10 iterations for the 
first problem and in 18 for the second one, which is much better than the results of [32] . With MGF, 
applied with the continuation strategy described in Section 3.7, we have converged in 23 iterations 
for the first problem. 


3.9 Helmholtz Equation with Finite Elements 


Finally, we have examined the Helmholtz equation 

U„+ Uyy + 0U = f 

with 0 — 200 and Dirichlet boundary conditions, discretized via bilinear finite elements. The grid 
for those elements is not uniform; in each space direction, the domain is divided into 4 elements, and 
the grid points are the roots of the Legendre polynomial of degree 17 in each element. Hence the 
total number of grid points is 63 2 . This grid induces a division of the domain into squares, which 
serve as the bilinear finite elements. The matrix operator for this bilinear element scheme may be 
used as a preconditioner for a spectral element discretization of the Helmholtz equation (see [31] and 
the references therein). Though the coefficient matrix has nine non-zero diagonals, the operators 
for coarser grids have five non-zero diagonals only; they are obtained from the above finite difference 
approximation in the automatic or classical manners. Actually, this is a double discretization strategy. 
The relaxations on the finest grid are the ILU iteration or the four-color Gauss-Seidel iteration. On 
the second grid, the relaxation is ILU or RB iteration. One presmoothing and one post-smoothing 
are performed on those two grids. On coarser grids, since the operators are more indefinite, these 
relaxation methods are too divergent; hence, we use instead the Kacmarz iteration, 40 presmoothings 
and 40 post-smoothings on each level. Since on the third grid the number of points is 1/16 of that of 
the original grid, the total work on that grid is about five Jacobi relaxations of the original system. 
The cost of the whole multigrid or MBF procedure is about 10 such relaxations. RRE acceleration 
is restarted after every 10 multigrid or MBF iterations. The number of MBF iterations needed to 
reduce the residual by 6 orders of magnitude is 28 when ILU is used on the two finest grids and 27 
when the Gauss Seidel smoother is used there. For MGF (with ILU on the two finest grids and with 
the continuation strategy of Section 3.7), the number of iterations needed is 52. When the residual is 
reduced by 6 orders of magnitude, the error is reduced by 6 orders for MBF and 5 orders for MGF. 

We also examined the mixed boundary conditions case. For the mixed boundary conditions (5) on 
the edges x = 0 and y = 0, with 0 = 200, a = lOi and N = 64, MBF converged in 52 iterations, each 
costs about as much as 7 Jacobi iterations, with RRE restarted after every 20 iterations. Classical 
MG methods did not converge for this problem, even with the continuation strategy of Section 3.7. 
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3.10 Helmholtz Equation with Spectral Elements 

The above double discretization strategy is not limited to bilinear element schemes; it was em- 
ployed successfully for the spectral element scheme of [33] as well. Again, we relax the original 
equation on the finest grid; then we compute the difference scheme based on the nodes of the spectral 
elements, and use it to generate the coarse grid operators via MBF. These operators are used to 
find the coarse grid correction. For the Helmholtz equation with the mixed boundary conditions (5) 
on the edges x = 0 and y = 0, with the parameters /? = 100, a = 100i' and N = 16 and with 
4x4 Legendre-type spectral elements, MBF converged in 16 iterations, each costs about as much 
as 5 Jacobi iterations (with RRE restarted after 10 iterations). For the same problem but with the 
parameters a = 200, p = 10* and N = 64, MBF converged in 60 iterations; each costs about as 
much as 2 Jacobi iterations. This rate of convergence was much better than that of the algorithm of 
[31], in which the spectral element scheme is preconditioned by a finite element or a finite difference 
scheme (even when the preconditioning system is solved by MBF). As a matter of fact, even for the 
Poisson equation our strategy was about 3 times faster than that of [31], 


4 DISCUSSION 

The MBF method is a version of multigrid, which is aut omati c in the sense that it depends 
on the algebraic system of equations rather than on the original PDE. Actually, it is a “black box” 
method for the solution of the linear system of equations. Hence it seems to be more robust than the 
classical multigrid method. For instance, nonsymmetric terms in the equation do not slow down the 
convergence, whether the characteristics are closed or open. Non-uniform grids are handled with the 
same efficiency, and no special treatment of the neighborhood of the boundary is needed. Moreover, 
when the RRE acceleration is applied to the method, it copes with the indefinite Helmholtz equation 
as well. For all these examples the rate of convergence is rather independent of the size of the problem. 
For anisotropic or pure advection problems, however, the rate of convergence of the MBF method 
applied with the RRE acceleration slightly depends on the size of the problem. 

The MBF method is especially suited to use with the ILU smoother. The red-black smoother 
gives slightly worse results. The doubled damped Jacobi iteration as a smoother (with a damping 
parameter 0.5) was examined too. For all the above problems but the discontinuous-anisotropic ana 
the hyperbolic problems, its performance was about twice slower than that of the ILU smoother. For 
those two problems, the damped Jacobi smoother was unsatisfactory. 

The versions of multigrid denoted MGF and MGN perform well for problems which do not involve 
central first derivatives (including discontinuous and anisotropic problems). For problems which do 
contain central first derivatives, since the algorithm is assumed automatic, tne discretization on coarse 
grids is of the same type as that of the fine grid, i.e, central. Hence divergence is often caused by 
the coarse-grids corrections. This difficulty can be handled by the special treatments of Section 3.5, 
but then the algorithm is no longer automatic. For the Helmholtz equation, one may overcome this 
difficulty by using a continuation strategy in the MGN version. Even though this (non- automatic) 
strategy is a bit slower than MBF , it is more stable and is applicable to more singular problems. If 
one uses Kacmarz relaxation on coarse grids, both MBF and MGF converge even for very indefinite 
problems, the MBF again being faster. ... ^ 

_ As opposed to the classical multigrid versions, the MBF is applicable whether the number of grid 
points in each space direction is even or odd. This indicates that it is applicable to problems defined 
on general regions. Given a region f 1 C -R 2 , one takes as a fine grid the restriction of an infinite 
2-dimensional fine grid to ft. For a coarser grid, one takes every otner point (in both x and y space 
directions) in the infinite fine grid, and takes the restriction to ft. The other coarse grids are created 
in the same wav. As we have seen, MBF is not affected by the possibility that some coarse grid 
points lie near aft. The coarse grid operators are created automatically as in the above description; 
this can be done easily by modifying the block sizes in the coefficient matrix of the system. Thus the 
algorithm is easy to program. 
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SUMMARY 


In this article, we discuss a non- variational Y-cycle multigrid algorithm based on the 
cell-centered finite difference scheme for solving a second-order elliptic problem with discontinue 
coefficients. Due to the poor approximation property of piecewise constant spaces an e 
non-variational nature of our scheme, one step of symmetric linear smoothing m our U-cycle 
multigrid scheme may fail to be a contraction. Again, because of the simple structure of the 
piecewise constant spaces, prolongation and restriction are trivial; we save significant computatio 
time with very promising computational results. 


INTRODUCTION 


In the simulation of incompressible fluid flow in porous media, we have to solve at least one 
second-order elliptic equation per each time step. A very important quantity is the Darcy velocity, 

deflned by u = -KVp (1) 

where p is the pressure of the fluid and K is the conductivity. K can be written by K = *, where J 
is a tensor representing the permeability of the medium which can be dnicontinuous in g<m«a . and 
p represents the viscosity of the fluid, p is a continuous function of both time and space ™>»ble s , 
but may have a very sharp frontal change of values. In other words, p can change rapidly inside the 
interesting domain and the region of rapid change may move as time changes. According to t e 
conservation law of mass balance, the Darcy velocity u must be continuous along the normal 
direction at an element or domain boundary, no matter whether K is discontinuous or not. 

Now, we consider the following simple second-order elliptic equation in mixed form. Find a pair 

(p, u) such that . 

' u = -KVp, in ft = (0, l) 2 C 1R 2 , 

V • u = /, in ft, ^ 

p — 0, on dCl, 

♦This work was supported in part by the Department of Energy under Contract No. DE-FG05 92ER25143. The 
authors would also like to thank Joe Pasciak for valuable discussions about this wor . 
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where the conductivity IC(x, y) — diag(a, 6) is positive and uniformly bounded above and below. 

Because of the discontinuity of /C, the classical solution of p in (2) may not exist. Let (•, •) 
denote L 2 (ft) or (L 2 (ft)) 2 inner product and H{ div;ft) = {u G (L 2 (ft)) 2 | V • u G L 2 (ft)}. We seek 
the solution pair (p, u) G H\ ft) x H{ div; ft), such that 

(/C _1 u, v) = (p,Vv), V v G iif(div, ft), 

(3) 

(V • u, w) = ( f,w ), w G L 2 (ft). 

In [5], error estimates for solving (3) by the cell-centered finite difference scheme are studied, 
with the following results: 

\\ P - Vp\\ v . + ||U - ttu|| l2 < ch s ||p|| 1+aAr , s = 1, 2, (4) 

where V x n is the Raviart-Thomas projection, T are the lines of discontinuity which coincide with 
the grid lines, and ( P , U) is the numerical solution of the cell-centered finite difference to 
approximate (3) [5]. Actually, we view the cell-centered finite difference method as a special 
numerical integration of the Raviart-Thomas mixed finite element method [4-6], For s = 2, (4) is 
the superconvergence error estimate. 

Rrom the point of view of mass balance and accuracy, the cell-centered finite difference scheme is 
one of the best numerical schemes to fulfill our goal. In this article, we investigate the efficiency of 
the multigrid algorithm based on the cell-centered finite difference scheme introduced in [5]. 

NUMERICAL SCHEME IN MULTIGRID SETTING 

Let us use the Laplacian operator, -A, to explain the cell-centered finite difference scheme 
stencil. For an interior node, the stencil for —A is (a) in Figure 1. For a corner node, the stencil for 
—A is (b) in Figure 1. For other boundary nodes, the stencil for —A is (c) in Figure 1. For 
discontinuous conductivity, see [5] for details. Now, we consider the uniform grid only. Let M k 
denote the piecewise constant Raviart-Thomas rectangular pressure space defined on ft with mesh 
size h k = 2“( fc+1 ), k = 0, 1, 2, 3, . . . , J. It is clear that 

Mo c Mi c M 2 C ... c Mj-i c Mj c L 2 (ft). (5) 

With an abuse of notation, for u G -M- k , u is either a piecewise function or a vector with its nodal 
values as its entries. On At*;, the cell-centered finite difference approximation is to find P G Mk, 
such that 

A k P = F k =V k f , fc = 0,1,2,... ,J. (6) 

Here P k : L 2 — ► M k is the L 2 -projection into M k defined by 

(/, w ) = (P fc /, w), V w G M k , (7) 

and / is the load function of (2). The corresponding stencil of A k is shown in Figure 1. Our goal is 
to find P G Mj, such that 

AjP = Fj = Vjf. (8) 
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(a) ...(b). (c) 


Figure 1. Stencils for the Laplacian operator. 


The discrete L 2 -inner product and associated norm on Mk are denoted by 

(u,v)k = h 2 k v T u and ||u||| = (w,«)fc, u,v e Mk, (9) 

where v T u is the usual algebraic inner product. Let Aj = Aj define an associated bilinear form Aj 
on Mj by 

(Ajw,<p)j = Aj(w,<j>), (10) 

Before we define A k for 0 < k < J, we first define the prolongation operator Ik and the 
restriction operator i^_ x . Let Ik : Mk-i — ► Mk, k = 1, 2, . . . , J be the natural imbedding from 
Mk-i to Mk- Thus : Mk -*• Mk-i, the adjoint of h in (•, •)*> is defined by 

(Pfc_i w, <f>)k - i = (™, h<t>)k, w e Mk, 0 € (11) 

From (9) and h k - 1 = 2 h k , it is clear that i^_ x = \I k in matrix form. Now, we define the bilinear 
form .Ajfc_i(-, •) and the matrix A k -\ on Mk-i for k = J, J — 1, . . . , 2, 1, by 

2 A k -i(u, v ) = A k {IkU , hv), V u, v € M k -i, (12) 

and the corresponding matrix relation is 

A*., = i ljA k I k = ^P£.,A k I k . (12') 

Remark 1. It is shown in [5] that for piecewise smooth conductivity tensor K., as long as the 
discontinuities coincide with the coarser grid lines 

Ak~\ = (l + O(hjfc)) Ak-i- (13) 

In (13) 0(h \ ) = Ch%. C depends on the local smoothness of 1C but is independent of the jumps. 
Since Ik is a simple operator, it is much easier to generate Ak- 1 by (12') than by (6) directly. Of 
course, Ak, k = 0, 1, 2, . . . , J — 1, are all positive definite since Aj is, and the spaces are nested. 
Because of (12), our multigrid algorithm can be considered as a black box solver once Ik has been 
defined. We mention that (12) holds for three-dimensional problems of —V • (KVu), with (12') 
being changed to 

A*-, = =lp£-,A k I k . 
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We also define the adjoint of I k in Ak( •), Pk-i ■ Mk —■ ► Mk- 1 by 

v) = A k (u , J*u), u € At*, u G Mt-i. (14) 

To define the smoothing process, we require linear operators Rk : Mk — * -Mk for k = 1,2 
These operators may or may not be symmetric with respect to the inner product (-,-)&. Let 
Ak = Dk + Lk + L \ , Dk be the diagonal part of A k , and Lk be the lower triangular part of A k - The 
linear smoothers we have tried are the following relaxation schemes. For 0 < u> < 2, 

(a) Gauss-Seidel: Rk 

(b) Jacobi: Rk 

(c) Richardson: Rk 

where I is the identity operator on A4 k and A*, is the spectral radius of Ak- We allow the relaxation 
parameter u to be different for pre-smoothing and post-smoothing processes in the following 
definition. 

Following [1] the multigrid operator B k '■ A4 k — ► M-k is defined by induction and is given as 
follows. The pre-smoother is denoted by R k and the post-smoother by R k - 

V -Cycle Multigrid Algorithm: 

Set Bq = A^ 1 . Assume that B k ~ i has been defined and define B k g for g G M k as follows: 

1. Set x° = 0. 

2. Define x l for £ = 1, 2, . . . , m(k) by x l — x *~ 1 + R k (g — 

3. Set y° = x m(fc) + I k B k -iPLx (d ~ A k x m ^). 

4. Define y l for £= 1, 2, . . . ,m{k ) by y l = y l ~ x -1- R k {g - A*,/ -1 ). 

5. Set B k g = y m(k) . 

Remark 2. Since equation (12) holds for all levels, this multigrid algorithm is non- variational 
according to [1], but the approximation property (4) is valid for each level as long as the 
non-variational relation (12) is satisfied. In this algorithm m(k) is a positive integer which may 
vary from level to level. In general this multigrid algorithm is not symmetric in (•, ■)*, except for 

R k = Re- 
setting K k ~ I - R k A k and K k = I - R k A k , it is straightforward to check that 

I-B k A k = K™ 2(k \l - IkBk-tP^AkjK^ 

= j^* 2(fc) [I - I k B k -iAk-iPk-i)K™ llk \ 1 
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Equation (16) gives a fundamental recurrence relation for the multigrid operator Bk- 

COMPUTATIONAL EXPERIMENTS 

We have tested the multigrid algorithm described in Section 2. We use a power method to 
compute the largest and the smallest eigenvalues of BjAj. 

The linear smoothers we have tried are the following. Let m be a positive integer. m(k) — m for 
all fc, 

5i(m) : Rk = t -I, Rk = Rk > 

Afc 

S 2 (m) : R k = —I, R k = 2R k , where \ k is the largest eigenvalue of A k , 

S z (m) : = (D k + L k )~\ Rk = R T k , 

^(m): = 1.35-Dfc 1 , Rk=-Rk, 


S 5 (m) : 

Rk = 


Rk = 

-Cs + X) 

&(">) : 

Rk = 

0 +i *f ’ 

Rk = 

- (2 Dk + L T k )-\ 


Note that only 5i(m) and S 3 (m) make BjAj Aj(-, •) symmetric. The rest are neither symmetric 
nor A j (■ , •) symmetric. We also have tried nonlinear smoothers, conjugate gradient, and diagonally 
preconditioned conjugate gradient algorithms. We shall use N(m) to represent our nonlinear 
multigrid by diagonally preconditioned conjugate gradient smoothers. The reason we choose 
different relaxation numbers comes from the suggestion [3] for an algebraic multigrid algorithm, and 
from our computational experiments. 

We list our test results in Tables 1—9 at the end of this paper for the following problems. 

Ex. 1. Poisson problem: K, = 1 in (2). 

Ex. 2. Isotropic problem with nearly singular piecewise smooth conductivity: 

JC = [O.OOI + ll.l(l + cos(3.561ttx) sin(3.00l7ry))] q, 

_ ( 10 -4 , if x > \ and y > 

® ~ 1 1, otherwise. 
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Ex. 3. Same kind of problems as Ex. 2: 

K = [O.OOI + 45.1 (l + cos(9.431ttx) sin(3.00l7ry))] q, 

_ f 10 4 , if x > | and y > 5, 

^ | 1, otherwise. 

Ex. 4. Anisotropic problem with smooth conductivity: 

K = diag(a, 6), 

a = 0.001 + 45.1 (l + cos(9.43l7rx) sin(9.43l7ry)), 
b = 0.001 + 45.1 (l + sin(9.43l7rx)cos(9.43l7ry)). 


Note that all the solutions of our examples have the superconvergence results proved in [5], i.e., 
satisfying (4) with s = 2. 


In Tables 1 and 6, for example, the second row of Table 1 means J + 1 = 3 level multigrid with 
hj = A m , 5i(l) means A m = min X(BjAj) by 5i(l) smoothers, and A m, <Si(1) means 
A m = maxA(BjA j) by 5i(l) smoothers. From Table 1, we can see that even when I — BjAj fails 
to be a reducer, Bj may still be a good preconditioner. In Tables 5-7, it is interesting to see the 
relations of the number of V-cycles (#V), average contraction numbers (avc) and the time spent on 
the machine (cpu in seconds) when solving a fixed problem on a fixed grid by using different 
multilevels. In Tables 3-5, and 7-9, avc is defined by 


avc = 


1 y |N|j 


where n = #V is the total number of V-cycles and ||rj||y is the discrete L 2 -norm of the residual 
after the yth V-cycle. The stop tolerance for all the iterative algorithms is ||r n || 2 < e — 10 -14 . Our 
coarsest grid solver is a diagonal preconditioned conjugate gradient solver with tolerance e 0 = 10 -19 
In Tables 7-9, “eg” means the standard conjugate gradient algorithm, its corresponding “#V” 
means the total iteration steps, when ||r n ||j < e = 1CT 14 , and “bpeg” means the incomplete 
factorization preconditioned conjugate gradient algorithm [2]. 
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Table 1. For Ex. 1 


Grid 

J 

A m ,5i(l) 

Am,Si(1) 

A™, (2) 

Am. Si (2) 

4 2 

1 

0.548 

1.351 

0.788 

1.134 

8 2 

2 

0.446 

1.804 

0.704 

1.297 

16 2 

3 

0.397 

2.394 

0.663 

1.470 

32 2 

4 

0.367 

3.128 

0.639 

1.633 

64 2 

5 

0.345 

4.023 

0.623 

1.783 

128 2 

6 

0.325 

5.106 

0.609 

1.924 

256 2 

7 

0.299 

6.417 

0.592 

2.059 


Table 2. For Ex. 1 


Grid 

J 

A m .S3(l) 

Am,S 3 (1) 

^mi £ 3 ( 2 ) 

Am»S3(2) 

4 2 

1 

0.858 

1.142 

0.971 

1.037 

8 2 

2 

0.812 

1.239 

0.960 

1.062 

16 2 

3 

0.794 

1.344 

0.954 

1.089 

32 2 

4 

0.785 

1.445 

0.951 

1.112 

64 2 

5 

0.784 

1.535 

0.950 

1.131 

128 2 

6 

0.784 

1.614 

0.949 

1.146 

256 2 

7 

0.783 

1.685 

0.949 

1.159 
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Table 3. For Ex. 1 by Bj(S 2 { 1)) 



Table 4. For Ex. 1 by Bj(S 3 ( 1)) 
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Table 8. For Ex. 3 


Grid 



m 2 ) 

S 3 ( 2) 

*V 

12 

11 

avc 

0.028 

0.016 

cpu 

2.3 

1.0 

#V 

14 

12 

avc 

0.034 

0.020 

cpu 

9.0 

1.8 

#v 

14 

13 

avc 

0.035 

0.023 

cpu 

36.5 

11.5 


17.5 33.5 


Table 9. For Ex. 4 


Grid 


N( 3) 

5e(2) 

13 

18 

0.012 

0.042 

32.0 

0.5 

16 

19 

0.031 

0.048 

13.0 

2.5 

21 

25 

0.07 

0.091 

68.0 

18.0 


1.2 x 10 1 


9.1 x 10 10 


7.2 x 10 11 




















































5/C>-3y 

/?-7£ 70> 

M94-2 tfiso 

A SEMI-LAGRANGIAN APPROACH TO 
THE SHALLOW WATER EQUATIONS 


J. R. Bates 

National Aeronautics and Space Administration, 
Goddard Laboratory for Atmospheres, 
Greenbelt, MD, 20771 

Stephen F. McCormick and John Ruge 
Computational Math Group, Campus Box 170, 
University of Colorado at Denver, 

PO Box 173363, Denver, CO, 80217-3364 

David S. Sholl 

Program in Applied Mathematics, Campus Box 526, 
University of Colorado at Boulder 
Boulder, CO, 80309-0526 

Irad Yavneh 

Oceanography and GTP, 

National Center for Atmospheric Research, 
Boulder, CO, 80307-3000 


Abstract 

We present a formulation of the shallow water equations that em- 
phasizes the conservation of potential vorticity. A locally conservative 
semi-Lagrangian time-stepping scheme is developed, which leads to a 


system of three coupled PDE’s to be solved at each time level. We 
describe a smoothing analysis of these equations, on which an effective 
multigrid solver is constructed. Some results from applying this solver 
to the static version of these equations are presented. 


1 Formulation of the Shallow Water Equa- 
tions 


The shallow water equations provide a two-dimensional prototype of the 
equations needed for three-dimensional simulations of atmospheric motions 
[1] [2]. They are useful for testing the viability of new numerical schemes 
for atmospheric simulation because they share many of the properties with, 
but lack the full complexity of, a full three-dimensional system. The shallow 
water equations can be written as 


du 

dt 


-<f>x + fv, 


dv 

dt 


= ~4>v ~ fu, 


dt 




( 1 ) 

(2) 

( 3 ) 


where u and v are the velocity components of the wind, D = u x + v y is 
the divergence of the velocity, / is the Coriolis parameter, and <j> is the 
geopotential height, assumed to be a positive function. The derivatives are 
material derivatives, that is, 


d d d d 
It = u Tx +v Ty + W 


( 4 ) 


A considerable amount of effort has gone into designing numerical methods 
that will solve these equations (see for example the references cited in [1]). 
The purpose of this paper is to study a multigrid scheme applied to a form 
of these equations that is of special physical interest. 

There are many possible formulations of the shallow water equations. 
We will derive a different formulation from the one above that has certain 
physical and numerical advantages. To this end, we define vorticity by 


£ 'Vx U y . 


( 5 ) 
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Then, subtracting the ^-derivative of Eq. 2 from the x-derivative of Eq. 1 
gives 

J t « +f)=~« + /)D. (6) 

Solving for D in Eq. 3 and substituting it into Eq. 6 yields 

^1 = °- < 7 > 

Eq. 7 is important in practice because it clearly asserts that the physical 
quantity (£ + f)/4> , called potential vorticity, is conserved in time along any 
Lagrangian trajectory. 

Now adding the x-derivative of Eq. 1 and the y-derivative of Eq. 2 gives 

^ = -VV - v.(/k x V) — N, (8) 

where k = (0,0,1), V = (u,u,0), and N = (u x ) 2 + i v y ) 2 + 2v x u y . It is not 
hard to see that Eqs. 3, 7, and 8 are equivalent to the original formulation 
of the shallow water equations (Eqs. 1-3), but they are not yet in the form 
we wish to consider. 

From the point of view of a multigrid solver, we will see that it is conve- 
nient to rewrite these equations in terms of the geopotential, <f>, the stream 
function, 0, and the velocity potential, y. The latter two variables satisfy 

V = k x W + Vx, (9) 

C = V 2 0, (10) 

d = v ! x. (11) 

Using these variables, we arrive at the form of the shallow water equations 

used in this paper: 

^i- 


dV 2 X 


= -v 2 ^ - V.(/k x Vx - /W) - N, 

d<j> ,„ 2 
S = -* v *’ 
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where N — {'fixy “ Xxx ) 2 + {^xy + Xyy ) 2 %{$yy Xxy)(^xx + Xxy)* 

These equations have several attractive properties. As already noted, they 
emphasize the conservation of potential vorticity along Lagrangian trajecto- 
ries. Furthermore, we shall show in Section 2 that, when a semi-Lagrangian 
approach is taken for the time derivatives, all of the variables appear in po- 
tential form in the resulting equations. This means that a simple vertex 
centered grid is sufficient to discretize the problem spatially; a staggered grid 
is not needed. This fact should be particularly useful when the problem is 
posed on a spherical domain. Finally, we shall see in Section 3 that these 
equations are well suited for multigrid solution. 

An ideal domain for simulating atmospheric motions is a sphere. However, 
a spherical coordinate system introduces many difficulties that may confuse 
the task of developing an efficient solver for the equations at hand. Thus, as 
a first step in determining the feasibility of applying multigrid methods to 
our formulation of the shallow water equations, we have chosen to solve the 
system on a cylindrical domain. Specifically, we consider a domain that is 
periodic in the x direction with length d and includes y in the range [0, L], 
We set x = V 7 = 0 and <f) = <j> 0 at the y boundary, where <f> 0 is a given 
constant. We assume that the Coriolis parameter may be written as 


f-f, + 0v, ( 15 ) 

with f 0 and /3 constants. This model allows us to determine the effective- 
ness of multigrid methods for these equations without the complications of 
constructing a full three-dimensional global atmospheric model. 


2 A Semi-Lagrangian Time Stepping Scheme 

Eqs. 12-14 are written in a Lagrangian reference frame in which the evo- 
lution of the fluid is observed along the paths of imaginary fluid particles. 
There are some obvious disadvantages of evolving a set of particles along 
Lagrangian trajectories numerically. In particular, a grid that is initially 
uniform will in general become very irregular, often leading to a degradation 
of global accuracy. As a compromise, semi-Lagrangian methods have been 
developed to produce numerical methods that preserve the advantages of 
regular grids while simultaneously taking advantage of the Lagrangian form 
of the equations. There is an extensive body of literature describing these 
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Time 



Figure 1: A schematic diagram showing the main quantities used in the 
calculation of the departure points for the semi-Lagrangian time-stepping 
scheme. The exact trajectory is represented by a solid line and the approxi- 
mate trajectory with a dashed line. 

methods. In particular, [1] provides an excellent review of the application 
of semi-Lagrangian methods to meterological problems. This reference de- 
scribes in detail a semi-Lagrangian scheme for the integration of Eqs. 1-3. 
The scheme we describe in this section is an adaptation of this scheme to 
our reformulation of the shallow water equations, and the reader is urged to 
consult [1] for more detail. 

The fundamental idea of a semi-Lagrangian scheme is to impose a regular 
grid at the new time level, and to backtrack the fluid trajectories to the 
previous time level. At the old time level, the quantities that are needed 
are evaluated by interpolation from their known values on a regular grid. In 
general, as is the case in our problem, the velocity field at the new time step 
is unknown, so the critical problem in this idea is the computation of the 
trajectory departure points. 

A schematic representation of the quantities involved in computing the 


departure points is shown in Fig. 1. The displacement between a grid point 
on the new time level, x m (t), and the departure point of the trajectory leading 
to this point on the previous time level, x^(f — A t), is denoted by a m . If the 
velocity field is considered to be constant from t — At to t, then a m satisfies 
the equation 

a m = A *V(x m -=f-^). (16) 

The velocity at time t — At/2 may be defined by extrapolation from the two 
previous time levels by 

V(x, t - Y ) = l v ( x > t-At)- iv(x, t - 2A t) + G(At 2 ). (17) 

Eqs. 16 and 17 give an implicit equation for a m in terms of the known velocity 
field at two previous time levels, and we may consider an iterative method 
for determining the correct a m . Assuming that a suitable approximation is 
made, then x m — a m /2 would not generally lie on a grid point, so the velocities 
at this point must be obtained by interplation. It has been shown [4] [5] [6] 
that for problems of this type it is sufficient to use linear interpolation to 
define the quantities in Eq. 17. It is also known [7] that succesive iteration 
for the solution of Eq. 16 converges provided 


At < 


1 

max[|u x |, \iLy | , |w x |, |uy | ' 


(18) 


Once the a TO are known, the departure point values of the variables in 
our equations are defined as illustrated by 


<f>* m (t - At) = <f>(x m -SL m ,t -At). (19) 


Again, these values must be interpolated from known values at the grid 
points. It has been found [4] [5] [6] that it is advantageous to do this using 
cubic interpolation. A material time derivative may then be discretized by 

* = h im A,) 3’ (20) 

and nonderivative quantities can be represented by the simple average 

* = \w) + nt - Aoi- (2i) 
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Using this discretization, our formulation of the shallow water equations 
may be manipulated to show that the equations that determine the solution 
of the system at a new time level are 

Vy + / + -// = 0, (22) 


vy + r(vy + fat - - /VV- + ] = h, 

* + [l + rW] = f 3 , 


(23) 

(24) 


where 


/ 1 = 


v V + /* 


/ 2 = V.[V*-r(/kxV* + V^)], 

h = r( 1 - rV 2 X *), 


(25) 

(26) 
(27) 


and r = At/2. The starred quantities are evaluated at the trajectory de- 
parture points at the previous time level, and the superscript + refers to 
quantities defined on a regular spatial grid at the new time level. We refer 
to Eqs. 22-24 as the static equations. The superscript + will be omitted in 
what follows. 

The numerical algorithm needed to integrate our form of the shallow 
water equations splits naturally into two pieces. The first task is to compute 
the departure point quantities needed to define /j, /z, and Ihis is done 
in the manner outlined above, using information from two previous time 
levels. The velocity field at any time level may be obtained from x and %j> 
using u = —ifiy + Xx and v = + Xy • O nce the departure point quantities 

are known, the second task is to solve the static equations. As we shall 
demonstrate below, it is possible to construct an efficient multigrid solver for 
these equations. Note that nowhere in this method is it necessary to solve 
Eqs. 10 and 11 for x and xf> in terms of u and v. 


3 Coupling Analysis of the Static Equations 

The coupling between the equations in any system of equations plays a piv- 
otal role in the behavior of the system. In particular, when discretized sys- 
tems of PDE’s are to be solved by multigrid, the coupling of the equations 


must determine the character of the relaxation schemes that are to be ap- 
plied. Fortunately, a straightforward method for analyzing the coupling of 
a system and its relation to constructing a multigrid solver is available [8], 
In this section we apply this method to the static equations derived above. 
Throughout the section, we use the definitions and notation of [8], to which 
the reader is referred for an understanding of the technique we are about to 
use. 

The linearized static equations are given in brief as follows: 


/ V 2 

—fi 

0 \ 

/ ^ 

— /tV 2 — T0dy 

rV 2 

V 2 + r/?<9 x 


V 0 

1 + tV 2 x 

r^V 2 / 

\ X 


h • (28) 

h I 


In constructing this system, we have associated variables with equations in 
the natural way; that is, Vs </>, and x are associated with Eqs. 22, 23, and 
24, respectively. 

The order array and weight array for this system are 


and 


Q = 


2 0 N 
2 2 2 
IV 0 2 


‘ N 

2 

N 

0 

N 

0 

N 

2 

N 


(29) 


(30) 


respectively. 

To account for finite mesh size effects, we need the scaled coefficient array 



N 


-h N ] 

1 T- 1 

1+tV 2 x 1 
t4> J 


(31) 


The computation of these arrays is straightforward. The method of [8] is 
almost automatic, and the arrays are included here explicitly only for com- 
pleteness. From these arrays, the coupling graph may be constructed, as 
shown in Fig. 2. 




Figure 2: The coupling graph for the static equations. The finite mesh- 
size coupling coefficients are f\ (1 — + 2), / (2 — ► 1), 1/t (2 — 3), and 
[l + rV 2 X ]/r^ (3-2). 


We may conclude immediately from the coupling graph that Eq. 22 is 
weakly coupled to Eq. 23 when 

fifh 2 « 1, (32) 

and that Eqs. 23 and 24 are weakly coupled when 

(1+tV 2 X )4^«1- ( 33 ) 

This implies that if both of these conditions are satisfied, then each equation 
may be relaxed separately, as though the system were fully decoupled. 

We now need to estimate the quantities in these coupling conditions using 
a physically realistic solution of a slightly different version of the shallow 
water equations. The equations we are dealing with assume that the surface 
of the fluid is free. To fix the surface profile of the fluid (the so-called rigid- 
lid’ condition), we set d<f>/dt = 0 in Eq. 1. It can then be shown by direct 
substitution that the following is an exact form for the resulting Rossby- 
Haurwitz wave solution: 

u = U — Al cos ft/ sin k(x — ct), (34) 


601 


/ 

/o = lx 10 -4 s -1 , /? = 1.57 x 10 -11 m -1 s -1 

T 

500s 

d 

1 x 10 7 m 

f i 

~ 10 -8 

L 

5 x 10 6 m 

h 

~ 10 -7 

<t> 0 

1 x 10 4 m 

h 

~ 10 4 


Table 1: Some typical physical parameters for the shallow water equations. 

v = Ak s\n ly cos k(x — ct), (35) 

<f> = <t>o — foUy — ]:0Uy 2 + A sin k(x — c<)[/sin ly — (c — U)l cos ly) + 

^-A 2 [/ 2 cos 2k(x — ct ) + k 2 cos 2 ly\, (36) 

where A and U are constants, c — U— (fl 2 /(k 2 + / 2 )) is the Rossby-Haurwitz 
phase speed, k = 2irm/ L for integer m, and / = nn/d for integer n. Waves of 
this type are the dominant feature of large scale weather motions. This solu- 
tion satisfies different boundary conditions from the problem we are treating, 
but it is nevertheless useful for estimating the size of the parameters in our 
system. It can be shown for this solution that 

V 2 x = o (37) 

and 

V 2 </> = —A(k 2 + l 2 ) sin ly sin k(x — ct). (38) 

Some typical numerical values of the parameters in the coupling condi- 
tions are shown in Table 1. A Rossby-Haurwitz wave with n = m = 1, 
A = 3 x 10 7 m 2 s -1 , and U — 20ms -1 , together with standard physical con- 
stants, was used to derive the data in this table. From these values it can be 
seen that Eq. 32 is satisfied, but Eq. 33 is certainly violated on intermediate 
and coarse grids. In terms of constructing a good smoother for the system, 
this means that Eq. 22 can be relaxed as though it were decoupled from 
the system, but the two remaining equations must be dealt with together, at 
least on coarse grids. In practise, it is easiest to use the same smoother on 
all grids to start with. 

To deal with Eqs. 22 and 23 together, collective relaxation is used. For 
linear equations, this means that, when the equations are relaxed at a point, 



corrections are made to all the variables associated with the equations such 
that the residuals of the equations become zero at that point. This may be 
done by replacing <j> and \ with </> + 6$ and x + ^x> respectively, in Eqs. 23 
and 24 at a single point and the differential operators with their discretized 
counterparts and solving for the corrections 8$ and S x . Because Eq. 24 is 
nonlinear, a term proportional to 8^8 X appears. We neglect this term and 
solve the resulting linear system directly. This method is equivalent to taking 
a single Newton step for these equations. 


4 Preliminary Numerical Results 

A preliminary code has been implemented that applies the multigrid method 
just described to the static equations. Eq. 22 was relaxed by red-black 
Gauss-Seidel iteration, and Eqs. 23 and 24 were relaxed collectively as de- 
scribed above in a lexicographic ordering. The equations are nonlinear, so 
the Full Approximation Scheme (FAS) [9] was used for the coarse-to-fine cor- 
rections. Full weighting was used for the fine to coarse grid restrictions, and 
linear interpolation for the coarse to fine grid transfers. Note that the grid 
transfers are straightforward because all the variables are defined on the same 
vertex centered grid. The standard five-point discretization was used for the 
Laplacian operator. Similarly, other derivatives were discretized using the 
usual finite difference formulae. At the time of writing, a semi-Lagrangian 
time-stepping scheme had been implemented, but the two codes had not been 
fully combined. 

In order to test the convergence of the multigrid scheme, we set the forc- 
ing functions in the static equations to a variety of functional forms. The 
magnitude of these functions was indicated by the Rossby-Haurwitz wave 
solution introduced in the previous section. When the problem was solved 
on a 64 x 32 grid with a V(l,l) cycle, the convergence rates for the L 2 norm 
of the residuals were 0.22, 0.25, and 0.27 for Eqs. 22-24, respectively. When 
a V(2,I) cycle was used , the rates were 0.15, 0.13, and 0.14. In each of these 
cases, a single relaxation sweep consisted of relaxing Eq. 22 once followed by 
relaxing Eqs. 23 and 24 collectively once. 

These results suggest that multigrid may be an efficient way of solving 
these equations. Clearly, there are many possible variants on the scheme 
described above. For instance, the coupling analysis suggests that it may be 


fruitful to relax Eqs. 23 and 24 independently on fine grids and switch to 
collective relaxation only on coarser grids. 
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MULTIGRID SOLUTION OF THE NAVIER-STOKES EQUATIONS 
ON HIGHLY STRETCHED GRIDS WITH DEFECT CORRECTION 
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Relaxation-based multigrid solvers for the steady incompressible Navier-Stokes equa- 
tions axe examined to determine their computational speed and robustness. Four relaxation 
methods with a common discretization have been used as smoothers in a single tailored 
multigrid procedure. The equations are discretized on a staggered grid with first order 
upwind used for convection in the relaxation process on all grids and defect correction to 
second order central on the fine grid introduced once per multigrid cycle. A fixed W(l,l) 
cycle with full weighting of residuals is used in the FAS multigrid process. The resulting 
solvers have been applied to three 2D flow problems, over a range of Reynolds numbers, on 
both uniform and highly stretched grids. In all cases the L 2 norm of the velocity changes 
is reduced to 10 -6 in a few 10’s of fine grid sweeps. The results from this study are used to 
draw conclusions on the strengths and weaknesses of the individual relaxation schemes as 
well as those of the overall multigrid procedure when used as a solver on highly stretched 
grids. 
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SUMMARY 


1. INTRODUCTION 

In recent years there has been considerable progress in the development of multi- 
grid solvers for the steady incompressible Navier-Stokes equations. The multigrid process 
and its application to fluid dynamics has been well described by Brandt 1 , Ghia et al. 2 
used the streamfunction vorticity formulation with the coupled strongly implicit scheme 
of Rubin and Khosla 3 as a smoothing operator and an accommodative multigrid cycle. 
Defect correction was used to increase the accuracy of the convection terms. Vanka 4 
employed a locally coupled Gauss-Seidel smoother for the primitive variable formulation 
together with an accommodative cycle. Demuren 5 extended Vanka’s smoother to one in 
which local corrections were coupled to neighboring pressure corrections and solved the 
resulting equations by both a strongly implicit technique and an alternating direction line 
Gauss-Seidel scheme. Thompson and Ferziger® used Vanka’s smoother as well as a fully 
coupled alternating direction line Gauss-Seidel extension and an accommodative cycle. 
This study also introduced defect correction together with local adaptive grid refinement. 
Sivaloganathan and Shaw 7 used the SIMPLE pressure-correction scheme of Patankar and 
Spalding 8 as a smoother for the primitive variable formulation. The smoothing analysis 
given in Shaw and Sivaloganathan 9 indicates that a fixed V-cycle was used in the multi- 
grid process. Dick 10 developed a partially flux-split discretization for the primitive variable 
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formulation and used a coupled red-black smoother and a fixed W-cycle. Finally, a few 
solvers have used boundary-fitted curvilinear coordinates with primitive variables. Joshi 
and Yanka 11 extended Vanka’s coupled Gauss-Seidel relaxation technique to this system. 
Rayner 12 and Shyy et al. 13 developed variants to the SIMPLE pressure correction method 
for use as smoothers with the latter applicable to all speeds. The last three references all 
employed a fixed V-cycle. 

In most of the above efforts a single relaxation scheme has been used as a smooth- 
ing operator in a chosen multigrid cycle and applied to one or more problems in order to 
demonstrate the characteristics of the flow solver. This doesn J t provide much guidance in 
the choice of smoother or multigrid cycle for the developer of a solver for a particular appli- 
cation. Furthermore, among the above works only Brandt 1 and Thompson and Ferziger 6 
have addressed the need for highly refined grids in local regions, which is present in most 
flow problems. The adaptive use of several levels of uniform local subgrids 6 is attractive 
in the multigrid context since it adds extra points only where they are needed. A more 
conventional approach employs stretched grids which may make it easier to resolve thin 
regions of steep gradients such as boundary layers adjacent to solid surfaces. This raises 
the question, however, as to whether fast multigrid performance can be maintained on 
these grids. 

The present work considers the primitive variable formulation of the steady incom- 
pressible Navier-Stokes equations in Cartesian coordinates. Four relaxation methods with 
a common discretization have been used as smoothers and embedded in a single tailored 
multigrid procedure. The equations are discretized on a staggered grid with first order 
upwind used for convection in the relaxation process on all grids and defect correction to 
second order central on the fine grid introduced once per multigrid cycle. The resulting 
solvers have been applied to three two-dimensional problems over a range of Reynolds 
numbers on both uniform and highly stretched grids. The results from this study are 
used to draw conclusions on the strengths and weaknesses of the individual relaxation 
schemes as well as those of the overall multigrid procedure when used as a solver on highly 
stretched grids. The results from an earlier study using first order hybrid differencing will 
be presented elsewhere 14 in somewhat greater detail. 

2. DISCRETE FORMULATION 

The steady incompressible Navier-Stokes equations in non-dimensional form are writ- 
ten as 


duu 

duv 

dp 

1 

(d 2 u 

d 2 u\ 

(1) 

dx + 

dy 

~ ~£h + 

Re 

\dx 2 


duv 

dvv 

dp 

1 , 

( d 2 v 

d 2 v\ 


dx 

dy 


Re ' 

[dx* 

+ dy*)' 

(2) 

du 

dv 

= 0, 





dx + dy 




(3) 


where it and v are the x and y velocity components, p is the pressure, and Re is the 
Reynolds number. 
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These equations are discretized on a staggered grid using a finite volume approach: 


-Rlj = L u u itj + d yj A sP ij = 0, (4) 

-R-ij = L v v itj + dxi A yPij = 0, (5) 

—Rij = dyj V x u i,j + dxi VyUjj = 0, (6) 

where A x , V x , A y , V y , are forward and backward differences in x and y, respectively, 
dn = Xi - x,_i, dyj = yj - yj-u and 


L u Uij = a u c mj - a“ u { - i (J - - a“ u i+ i tj - a" Uij-i - a“ u.-j+i, 
L v = a" uj,j - a v w u,-i ,j - a£ u<+i,j - a" 


(7) 

( 8 ) 


When these expressions require points outside the domain, such as L“ Uij adjacent to a 
horizontal boundary, these points are transferred to the boundary by quadratic extrapo- 
lation. A linear extrapolation is employed at an outflow boundary where Pij is specified. 

The coefficients in Eqs. (7) and (8) are obtained by utilizing first order upwinding 
for convection and second order central differencing for diffusion. The difference between 
first order upwind and second order central convection discretizations on the finest grid 
is added as a defect correction source term in a manner similar to that of Thompson and 
Ferziger 6 . Prior to each sweep through the grid a single set of coefficients, a p , is obtained 
for equations centered on the pij locations and held constant during the sweep. The 
coefficients a u and a v are obtained by averaging. Thus 


( a c)i,j ~ l( a c)i,j + ( a ?)f+l ( a c)»ij _ [«)« + ( ct c)*,j-l-i]/2- 


For the convective terms this is equivalent to obtaining the cell face velocities by averaging. 
For the viscous terms this introduces an error on a stretched grid that is of the same order 
as the truncation error. In the immediate vicinity of a reentrant corner this practice must 
be modified to ensure that the convective velocity normal to the wall is set to zero. 


3. RELAXATION METHODS 

Each of the relaxation methods employed as a multigrid smoother in this work is 
adapted from or similar to a known technique from the literature, and hence the descrip- 
tions of the schemes will be brief. The methods are written in a common block-tridiagonal 
form for the corrections along a horizontal line: 

-AjAVi-i + BiAVt-Cj AV i+ i = Di, (9) 

where AVj is the vector of local corrections, Ai, Bj, Cj are square matrices, and Dj is 
the vector of local residuals. By appropriate choices of the square matrices, Eq.(9) can be 
used to describe both point or explicit schemes and semi or fully implicit schemes. This 
equation is now particularized for each of the methods. 
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The first method, here labeled Block Gauss-Seidel (BGS), is a locally coupled explicit 
scheme introduced by Vanka^. Four discrete momentum equations and one continuity 
equation are solved for a set of local corrections. In this case 


is a 5 x 5 matrix, 


AVj = (Auj_ 

■i,j» Au 


Auj.y, Ap, 

■ •V r 
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( 10 ) 


( 11 ) 


and Aj = Cj = 0. Elimination of the Au’s and Au’s gives a simple expression for A p,- y 
and back substitution then gives the local Au’s and Au’s. In a single sweep through the 
grid, each momentum equation is updated twice and each continuity equation once. 

The second method, labeled Pressure-linked Line Block Gauss-Seidel (PLBGS), is 
a locally coupled semi-implicit scheme which is similar to the line relaxation scheme of 
Demuren 5 . This case is a simple extension of BGS: 


— (Auj.y, Avij-i ) Aui ( y, Apij) T , 


( 12 ) 


is a 4 x 4 matrix obtained by eliminating the top row and left column from Eq.(ll), 
and A| = Cj = 0 except for the lower left and upper right corner elements, respectively. 
Elimination of the Au’s and Au’s gives a scalar tridiagonal equation for the Ap’s along the 
horizontal line and back substitution then gives the Au’s and Au’s along the line. During 
a single sweep in the +y direction, each u-momentum equation is updated once, each 
u-momentum equation twice, and each continuity equation once. The fewer momentum 
updates and the efficiency of the scalar tridiagonal inversion gives a scheme that costs 15% 
less per sweep than BGS. In general both x and y sweeps are combined in an alternating 
pattern to form an effective relaxation technique. 

The third method, labeled Line Block Gauss-Seidel (LBGS), is a locally coupled, 
fully implicit scheme, which is apparently very similar to the coupled alternating line 
approach of Thompson and Ferziger 6 . The vectors AVj and Dj and the matrix B; are the 
same as for PLBGS, while Aj and Cj are 4x4 matrices having diagonal plus the lower 
left and upper right corner elements, respectively. The number of equation updates and 
sweeping patterns are the same as for PLBGS. In this case algebraic elimination in the 
block-tridiagonal inversion gives a scheme that costs only 15% more per sweep than BGS. 

The final method is the Semi-Implicit Pressure-Correction scheme (SIMPLE) intro- 
duced by Patankar and Spalding 8 . In this case 


AV| = (Auij.A v u f, 

Di = (Rli.Rhf, 


(13) 
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where Ai,Bi,Ci axe diagonal 2x2 matrices. The pressure is obtained from an elliptic 
equation derived by substituting reduced forms of the discrete momentum equations for 
coupled velocity and pressure corrections into continuity. For this work one SIMPLE iter- 
ation consists of a single scalar line Gauss-Seidel sweep for each momentum equation with 
the pressure fixed. This is followed by four alternating direction line Gauss-Seidel sweeps 
of the elliptic pressure- correction equation. Taking more than one sweep through the mo- 
mentum equations before correcting the pressure invariably resulted in partial decoupling 
of the velocity components and slower convergence. Each of these combined SIMPLE 

iterations costs about 30% more than one sweep of BGS. 

For each of these relaxation techniques some degree of underrelaxation is required to 
obtain convergence. In the present work this is implemented through direct modification 
of the momentum equations. For BGS, LBGS, and SIMPLE, the diagonal velocity coeffi- 
cients, a" and a", in the matrix Bj are divided by a factor r mom where 0 < r mom < 1. For 
PLBGS the residuals, R u and R v , are multiplied by r room . In addition for SIMPLE the 
pressure corrections and the corresponding velocity corrections required to satisfy conti- 
nuity are unrelaxed. 

Finally, we note that considerable improvement can be obtained with each of the above 
methods by employing a symmetric sweeping pattern. Thus for BGS each lexicographic 
sweep is followed by one in the reverse direction. For PLBGS, LBGS, and SIMPLE a 
four sweep symmetric alternating line pattern is used, i.e. relaxation is performed sequen- 
tially in the +x , +y, -y, -x directions. These techniques result in an approximately 25% 
improvement in convergence rates. 

4. MULTIGRID IMPLEMENTATION 


Local relaxation methods, such as those of the previous section, are in general much 
more efficient at reducing short wavelength error components on a given grid than those of 
longer wavelength. Multigrid seeks to overcome this problem by transferring the longwave 
components of the solution to a sequence of coarser grids where relaxation is more effec- 
tive and much cheaper. Since the FAS-FMG technique used in this work has been well 
documented in the literature 1 ’ 2 - 4-7 , it will not be described here. The focus will instead 
be on the current implementation and in particular on those aspects which are important 
for achieving a fast, robust Navier-Stokes solver. 

In the present work the coarse grids are created by “standard coarsening,” i.e., every 
second grid point in both x and y is deleted from one grid to the next coarser grid. 
The fine-to- coarse restriction operator Ij for unknowns employs cell-face averaging for the 
velocities, 


u i,j = (ui,j-idyj-i + Ui,jdyj)/dyj, v c itj - (vi-ijdxj-i + v^jdx^fdx^, (14) 


and full- weighting for the pressures, 

Pij = ( Pi-i, j-idxi-idyj-i + pi-ijdxi-idyj 

+ pi,j-\dxidyj-i +Pi,jdxidyj)l(dx c i dy c j), 


(15) 
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where ( ) c represents a coarse-grid value. The restriction operator Ij for residuals uses 
full-weighting, in which all the fine-grid contributions to a coarse-grid cell are accounted 


for: 


(Wm = 1 + R li + 2 W-U-1 + R '-U + + *?+l j), 

(IfR )ij — + R,j + 1 + 


(16) 


where R u and R v , given by Eqs.(4) and (5), are already area-weighted. With the cell-face 
averaging given by Eqs.(14) and full-weighting similar to Eq.(15) for R c , as defined by 
Eq.(6), the coarse-grid source term vanishes for the continuity equation. 

In many of the previous works 4 ,5 ’ 7 ’ 13 cell-face averaging was also used in the restriction 
of R u and R v . For uniform grids this has little effect on the multigrid convergence rate. 
For the highly stretched grids employed in this work this proved to be ineffective. In some 
cases convergence slowed by a factor of three or four. In others little or no benefit was 
gained from the multigrid process. 

The coarse-to-fine prolongation operator l{ for corrections employs bilinear interpo- 
lation in computational space where the grid spacing is taken to be uniform. For fine grid 
points adjacent to boundaries, a zero normal gradient is assumed for pressures. The overall 
convergence has proven to be insensitive to the details of this approximation. The same 
operator with one modification is also used to interpolate “converged” results to obtain 
initial values on a fine grid in the FMG process. The velocity component parallel to an 
adjacent wall is obtained by bilinear extrapolation from the interior since the boundary 
layer is poorly resolved on the coarse grid. 

The multigrid solvers in this work have been coded to permit fixed V-cycles and W- 
cycles. During the course of this effort it was found that for the difficult cases with high 
Reynolds numbers or highly stretched grids a W(l,l) cycle was the most effective strategy 
in terms of robustness and computational cost. Hence, all results presented in this paper 
were performed using this cycle. The defect correction source term, discussed earlier, is 
updated once per cycle on the finest grid. Accommodative cycles 1 ’ 2 * 4-6 , which decide on 
whether or not to restrict to a coarser grid based on the ratio of errors from two successive 
sweeps, proved to be too costly since the second sweep on each visit to a grid contributed 
little to the overall convergence of the method. 

The symmetric sweeping pattern described in the previous section has been interleaved 
with the multigrid process. A sweep counter is established for every grid level, and on each 
visit to that level the next direction in the sweep pattern for that grid is performed. This 
proved to be sufficient to give all the convergence benefits of the sweeping symmetry. 
Finally, it should be noted that varying the momentum relaxation factor r mom from grid 
to grid during the cycle provided considerable performance enhancement for the BGS, 
PLBGS, and LBGS solvers. No benefit, however, was observed when this was tried with 
the SIMPLE-based solver. 


5. CONVERGENCE CRITERIA 

The various convergence criteria used in this work are all based on an L 2 norm of 
the dynamic velocity changes ocurring during a sweep through the grid. This would seem 
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to be a more appropriate form for a system of coupled equations than one based on a 
combination of the residuals of the different equations. The pressures have been excluded 
since they are only determined to within an arbitrary constant. Introduce the definition 



where n k and n k are the number of cells on grid k in x and y respectively, and A uf^, A vfj 
are the dynamic velocity changes obtained during a sweep on grid k. Then for a sequence 
of coarse-to-fine grids, k = 1 to m, the overall convergence criterion on grid m is taken as 

e m < 10" 6 . (18) 

In most cases at convergence given by Eq.(18) the value of max(Au, At>) is approximately 
10 -5 . For intermediate grids in the FAS-FMG process, convergence before interpolating 


to the next finer grid is taken as 

e k < 10“ 3 , 

(19) 

and for the coarsest grid, k = 1, 

“solution” is given by 



e 1 < £*/10, 

(20) 

where now e k is the most recent error on the current finest grid. 



Finally, it is noted that all computations in this work were performed on an Amdahl 
5980 in scalar mode. All CPU times reported in the next sections are for this machine. 


6. COMPUTATIONAL RESULTS 

Three problems have been chosen to test the performance of the multigrid solvers 
under different conditions: flow in a driven cavity, developing flow in a straight channel, 
and flow over an open cavity. 

Driven Cavity Flow 

The driven cavity is the prototypical recirculating flow and has long been used as a 
standard test problem for Navier-Stokes solvers. The second-order streamfunction-vorticity 
results of Ghia et al. 2 are generally accepted as the standard. Flow is set up in a square 
cavity with three stationary walls and a top lid that moves to the right with constant speed 
(it = 1). Profiles of u on the vertical centerline computed on a uniform 256x256 grid for 
Re = 1000 and 5000 are compared with the standard results 2 in Figure 1. The present 
defect correction results agree with the standard to within plotting accuracy. 

The first set of results for this flow is for a uniform grid with Re varying from 100 
to 5000. Table I compares the uniform grid results for each solver on a 256x256 grid in 
terms of cpu times, number of fine grid sweeps, and total work units for each case where a 
work unit is the cpu time required for one fine grid sweep of the particular smoother. Here 
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v mom i s fine grid relaxation factor for the solver. As the Reynolds number increases 
the table indicates a significant advantage for BGS and PLBGS over LBGS and SIMPLE 
due to faster convergence and less cost per sweep. 

The second set of results is obtained for Re = 1000 on a grid with hyperbolic tangent 
stretching in x and y and the maximum mesh aspect ratio (AR) varying from 1 to 40. Table 
II compares the stretched grid results for each solver on a 256x256 grid. As AR is increased 
to large values BGS is seen to have a significant advantage over the other three methods in 
both number of sweeps and cpu time. The use of highly stretched grids produces a strong 
asymmetry in the momentum equation coupling coefficients in regions of high mesh aspect 
ratio and this was expected to adversely affect the smoothing properties of an explicit 
scheme 1 such as BGS. The alternating direction semi-implicit and fully implicit schemes 
were introduced to see if they would give more robust performance for these cases. This 
proved not to be true for the Navier-Stokes solvers used in this study. 

Developing Channel Flow 

The second test problem is the deceptively simple one of developing flow in a straight 
channel one unit high by four units long. Uniform velocities (u = 1, v = 0) are specified 
at the entrance and a constant pressure (p = 0) is set at the exit. Note, for incompressible 
flow, the common exit condition, du/dx = 0, implies dv/dy = dp/dy = 0. Profiles of u vs 
y along the channel for Re = 1000 are shown in Figure 2. For this and higher Reynolds 
numbers the flow is far from fully developed at the exit. This flow has velocities strongly 
aligned with the x direction over much of the domain and the u momentum equation 
becomes increasingly decoupled in y away from the walls as Re is increased. This situation 
is known to cause problems for multigrid solvers (see e.g. Brandt 1 and Mulder 15 ) and thus 
was chosen as a fitting test case for this study. 

The first set of results is for a uniform grid with Re again varying from 100 to 5000. 
The uniform grid results for each solver on a 256x64 grid are compared in Table III. It is 
evident that the multigrid performance of all solvers degrades more rapidly with increasing 
Re than was the case for the driven cavity. The relatively poor performance of SIMPLE 
is probably due to the partial decoupling between u and t; at high Re which was observed 
during the iterative process. Note, however, that all methods still converged in under 100 
fine grid sweeps even at the highest Reynolds numbers. 

The second set of results for this flow is for hyperbolic tangent stretching in y only, 
again with AR varying from 1 to 40 and Re = 1000. Stretched grid results for each solver 
on a 256x64 grid axe compared in Table IV. As AR is increased to large values, it is evident 
that LBGS has a major advantage over the other smoothers in both fine grid sweeps and 
cpu time. This case of strong alignment on a stretched grid is the only one in which an 
implicit scheme (LBGS) has a substantial advantage over the explicit BGS. 

Open Cavity Flow 

The final test problem combines the driven cavity and developing channel flows and 
adds the complication of a strong corner singularity. The domain consists of a channel one 
unit high by two units long on top of an open square cavity one unit on a side located at 
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the left boundary of the channel. Uniform flow (u — 1, v — 0) enters the channel at the 
left and the flow exits at the right (p = 0). Streamfunction and vorticity contours for Re 
= 1000 are shown in Figure 3. Note the lack of separation and the strong concentration 
of vorticity contours at the downstream corner. 

As before the first set of results is for a uniform grid with Re varying from 100 to 
5000. The uniform grid results for each solver on a 128x128 + 256x128 grid are compared 
in Table V. The results for both fine grid sweeps and cpu time show that BGS, PLBGS, 
and LBGS remain competitive as Reynolds number is increased but SIMPLE suffers a 
substantial penalty. 

The second set of results for this flow is for hyperbolic tangent stretching in both x 
and y, in each of three square regions, with AR varying from 1 to 40 and Re = 1000. 
The stretched grid results for each solver on a 128x128 + 256x128 grid are compared in 
Table VI. Here it is evident that BGS has a significant advantage in fine grid sweeps and 
cpu time as AR increases. It should also be noted that PLBGS and LBGS appeared to 
be more sensitive to the presence of the corner singularity and to the choice of r mom for 
the set of grids used in the multigrid process. However no detailed study of this effect was 
performed. 


7. CONCLUSIONS 


From the above results, it is evident that a proper combination of tailored multigrid 
elements can yield a fast robust solver for the steady incompressible Navier-Stokes equa- 
tions even on highly stretched grids. In particular, for fine-to-coarse restriction of residuals, 
the use of full weighting is important on stretched grids. For coarse-to-fine prolongation of 
corrections, on the other hand, bilinear interpolation works well and is insensitive to the 
details of the boundary treatment. And finally a fixed W(l,l) multigrid cycle appears to 
offer a good mix of robustness and computational efficiency. 

For recirculating flows such as the driven cavity, all four smoothers are effective and 
competitive. On uniform grids BGS and PLBGS offer a significant advantage over LBGS 
and SIMPLE, primarily due to less cost per sweep. On stretched grids BGS and SIMPLE 
show superior multigrid performance, but BGS is substantially cheaper per sweep. 

For strongly aligned flows such as that in a developing channel, all four solvers degrade 
more rapidly with increasing Reynolds number than for recirculating flows with SIMPLE 
falling off much more rapidly than the others, but they all still converge in under 100 fine 
grid sweeps. On highly stretched grids, however, LBGS offers a major advantage in both 
multigrid performance and net cpu time over the other three smoothers. This is the only 
case in which an implicit scheme is distinctly superior to the explicit BGS. 

For mixed recirculating/aligned flows such as the open cavity, all four smoothers are 
effective. On uniform grids, SIMPLE again degrades much more rapidly than the others 
with increasing Reynolds number. On stretched grids BGS offers a small advantage in 
multigrid performance, but this becomes significant when net cpu time is considered. It is 
also notable that BGS is less sensitive than the other smoothers to the corner singularity 
in this flow. 
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On balance BGS offers the best mix of robustness and computational speed for all 
three classes of flows. The semi-implicit schemes PLBGS and SIMPLE offer little or no 
advantage and in general are less robust. The fully implicit LBGS is superior only for the 
case of highly aligned flows on stretched grids. The pressure correction scheme SIMPLE 
is in general more costly than the other three and degrades much more rapidly than the 
others with increasing Reynolds number. Finally, we note that for a general multigrid 
solver set up using domain decomposition, it might be highly effective to use BGS over 
most domains but retain the option to use LBGS in strongly aligned domains. 
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Table I. Driven Cavity Convergence 
Uniform 256x256 grid, AR = 1 


Scheme (r£? om ) 

Re 

100 400 1000 3200 

(cpu seconds /fg. sweeps/work units) 

5000 

BGS (0.6) 

140.2 

165.4 

197.6 

350.7 

483.4 

9 

10 

12 

22 

30 


22.1 

26.0 

31.0 

55.6 

76.0 

PLBGS (0.7) 

147.3 

154.6 

207.1 

348.1 

517.4 

10 

11 

14 

24 

36 


26.6 

27.6 

37.2 

62.6 

93.3 

LBGS (0.7) 

180.6 

183.4 

219.0 

435.4 

642.8 

10 

10 

12 

24 

36 


25.1 

25.4 

30.5 

60.5 

89.1 

SIMPLE (0.7) 

257.3 

256.7 

307.2 

554.4 

806.5 

12 

12 

14 

26 

38 


27.2 

27.4 

32.8 

59.0 

85.9 


Table II. Driven Cavity Convergence 



Stretched 256x256 grid, Re = 

1000 





AR 



Scheme (r£? om ) 

1 

5 

10 

20 

40 


(cpu 

seconds/fg. sweeps/work units) 


BGS (0.6) 

197.8 

168.6 

168.6 

199.9 

231.3 

12 

10 

10 

12 

14 


31.0 

26.4 

26.4 

31.2 

36.1 

PLBGS (0.5) 

260.5 

205.7 

211.0 

268.0 

324.6 

18 

14 

15 

19 

23 


47.1 

36.9 

37.9 

47.9 

57.9 

LBGS (0.9) 

220.3 

220.5 

290.2 

395.0 

604.2 

12 

12 

16 

22 

34 


30.5 

30.1 

39.5 

53.7 

81.9 

SIMPLE (0.7) 

311.8 

276.2 

304.9 

305.4 

344.8 

14 

13 

14 

14 

16 


32.7 

28.4 

31.9 

31.8 

36.0 
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Table III. Developing Channel Convergence 
Uniform 256x64 grid, AR = 1 


Scheme (r# om ) 

Re 

100 400 1000 3200 5000 

(cpu seconds/fg. sweeps/work units) 

BGS (0.7) 

53.0 

64.5 

82.4 

152.3 

203.4 


12 

14 

18 

34 

46 


30.1 

36.4 

46.9 

87.0 

116.2 

PLBGS (0.8) 

49.6 

80.0 

95.2 

159.9 

218.0 


12 

20 

24 

40 

56 


32.5 

52.2 

62.3 

105.2 

142.5 

LBGS (0.8) 

59.0 

78.5 

78.3 

136.1 

173.5 


12 

16 

16 

28 

36 


30.7 

40.5 

40.7 

70.7 

89.4 

SIMPLE (0.7) 

83.6 

124.0 

168.9 

324.4 

418.7 


16 

24 

32 

64 

84 


39.9 

58.8 

79.9 

154.3 

199.6 


Table IV. Developing Channel Convergence 
Stretched 256x64 grid, Re = 1000 

Scheme (r£? om ) 

AR 

1 5 10 20 40 

(cpu seconds/fg. sweeps/work units) 

BGS (0.7) 

81.9 

139.9 

202.1 

251.7 

268.2 


18 

32 

46 

58 

62 


46.9 

79.4 

113.8 

141.1 

150.8 

PLBGS (0.7) 

110.2 

117.3 

193.8 

196.3 

230.4 


28 

30 

50 

50 

59 


72.4 

75.5 

124.5 

127.2 

147.5 

LBGS (0.85) 

78.4 

87.8 

105.4 

123.9 

142.1 


16 

18 

22 

26 

30 


40.6 

44.7 

54.0 

63.4 

73.3 

SIMPLE (0.7) 

168.2 

142.0 

162.0 

201.1 

261.5 


32 

28 

32 

40 

52 


79.9 

67.6 

76.9 

95.1 

123.6 
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Table V. Open Cavity Convergence 
Uniform 128x128 + 256x128 grid, AR = 1 


Re 

Scheme (r # om ) 100 400 1000 3200 5000 


(cpu seconds/fg. sweeps/work units) 


BGS (0.7) 

173.5 

203.1 

230.7 

381.5 

573.3 

12 

14 

16 

26 

40 


29.7 

34.6 

39.7 

65.2 

98.7 

PLBGS (0.7) 

186.2 

237.7 

262,2 

465.6 

616.6 

15 

19 

20 

36 

48 


36.4 

46.4 

51.7 

91.2 

120.9 

LBGS (0.8) 

166.5 

221.1 

256.9 

439.4 

619.5 

11 

14 

16 

28 

40 


25.7 

34.4 

40.2 

68.7 

97.5 

SIMPLE (0.7) 

243.9 

319.5 

419.6 

705.6 

1004.2 

14 

18 

24 

40 

58 


32.5 

42.6 

56.1 

93.5 

133.6 


Table VI. Open Cavity Convergence 
Stretched 128x128 + 256x128 grid, Ite = 1000 


AR 


Scheme (r# om ) 

1 5 10 20 40 

(cpu seconds/fg. sweeps/work units) 

BGS (0.7) 

230.0 

175.8 

176.7 

206.0 

233.6 

16 

12 

12 

14 

16 


39.6 

30.1 

30.2 

35.2 

40.0 

PLBGS (0.5) 

367.1 

263.4 

238.3 

245.2 

298.3 

28 

20 

18 

19 

23 


70.6 

50.8 

46.1 

47.2 

57.1 

LBGS (0.95) 

224.4 

219.8 

222.2 

283.0 

343.0 

14 

14 

14 

18 

22 


35.6 

34.5 

34.6 

44.1 

53.5 

SIMPLE (0.7) 

423.0 

289.3 

280.8 

282.1 

379.6 

24 

16 

16 

16 

22 


55.9 

38.0 

37.2 

37.2 

50.5 
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Figure 1. Driven cavity u- velocities on vertical centerline computed on a 
uniform 256x256 grid for Re = 1000 (left) and 5000 (right) 



Figure 2. Developing channel u-velocity profiles for Re — 1000 
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SUMMARY 

A multigrid preconditioned conjugate gradient method (MGCG method), which uses the 
multigrid method as a preconditioner of the PCG method, is proposed. The multigrid method has 
inherent high parallelism and improves convergence of long wavelength components, which is 
important in iterative methods. By using this method as a preconditioner of the PCG method, an 
efficient method with high parallelism and fast convergence is obtained. First, it is considered a 
necessary condition of the multigrid preconditioner in order to satisfy requirements of a 
preconditioner of the PCG method. Next numerical experiments show a behavior of the MGCG 
method and that the MGCG method is superior to both the ICCG method and the multigrid 
method in point of fast convergence and high parallelism. This fast convergence is understood in 
terms of the eigenvalue analysis of the preconditioned matrix. From this observation of the 
multigrid preconditioner, it is realized that the MGCG method converges in very few iterations and 
the multigrid preconditioner is a desirable preconditioner of the conjugate gradient method. 

1 INTRODUCTION 
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The typical numerical methods of a king-size system of linear equations, after discretization of 
the partial differential equations, are the preconditioned conjugate gradient method (PCG method) 
and the multigrid method [12]. The conjugate gradient method is valued in that it suits to parallel 
computing and even ill-conditioned problems can be easily solved with the help of a good 
preconditioning. 

This paper considers an efficient preconditioner and proposes a multigrid preconditioned 
conjugate gradient method (MGCG method) which is the conjugate gradient method with the 
multigrid method as a preconditioner. The combination of the multigrid method and the conjugate 
gradient method was already considered. Kettler and Meijerink [7] and Kettler [8] treated the 
multigrid method as a preconditioner of the conjugate gradient method. However this paper 
formulates the MGCG method more generally than these and requirements of the multigrid 
preconditioner are studied. On the other hand, Bank and Douglas [2] treated the conjugate gradient 
method as a relaxation method of the multigrid method. Braess [3] considered these two 
combinations and reported that the conjugate gradient method with a multigrid preconditioning is 
effective for elasticity problems. 
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We study requirements of the valid multigrid preconditioner and evaluate this preconditioner by 
some numerical experiments and eigenvalue analysis. Especially, eigenvalue analysis is a more direct 
and more reasonable criterion than convergence rate, since the number of iterations of the conjugate 
gradient method Until convergence depends on the eigenvalues’ distribution of the preconditioned 
matrix. In Sections 2 and 3, the preconditioned conjugate gradient method and the multigrid 
method which are the basis of this paper are briefly explained. Section 4 discusses the requirements 
of the valid two-grid preconditioner for the conjugate gradient method; then in Section 5, it is 
extended to the requirements of the multigrid preconditioner. In Section 7, numerical experiments 
show that the MGCG method converges with very few iterations even for ill-conditioned problems. 
In Section [8], eigenvalue analysis is performed, and it is realized why the MGCG method can easily 
solve the problem that the ordinary multigrid method itself does not converge rapidly. When the 
multigrid method is used as a preconditioner of the conjugate gradient method, it becomes quite an 
effective and desirable preconditioner of the conjugate gradient method. 

2 THE PRECONDITIONED CONJUGATE GRADIENT METHOD 

If a real n x n matrix A is symmetric and positive definite, the solution of a linear system 
Ax = f is equivalent to minimization of the quadratic function 

Q(x) = ^ x t Ax - f T x. (1) 

The conjugate gradient method is one of the minimization methods and uses A-conjugate vectors as 
direction vectors which are generated sequentially. Theoretically this method has the striking 
property that the number of steps until convergence is at most n steps. This method can be 
adapted successfully to the parallel and vector computation, since one CG iteration requires only 
one product of the matrix and the vector, two inner products, tree linked triads, two scalar divides 
and one scalar compare operation. 

Next the preconditioned conjugate gradient method is explained. Let U be a nonsingular matrix 
and define A = U AU T ; then solve Ax = f using plain conjugate gradient method. Let *° be an 
initial approximate vector; then an initial residual r° is r° = / — Ax°. Let M = U T U, f° = Mr 0 
and an initial direction vector p° — f °. The PCG algorithm is described by Program 1. 

The matrix M is a precondition matrix and this paper focuses on this computation. A new 
proposal is the PCG method exploiting the multigrid method as a preconditioner. 

On the other hand, the matrix M should satisfy some conditions: symmetric and positive 
definite. Therefore if the matrix of the multigrid method is symmetric and positive definite, it is 
reasonable to use the multigrid method as a preconditioner of the CG method. In Sections 4 and 5, 
the conditions of the multigrid preconditioner in order to satisfy the requirements of a 
preconditioner of the conjugate gradient method are investigated. 
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i = 0; 

while ( ! convergence ) { 
at = (fj,r <)/(Piii4f><); 

®»+i = + 

»\+l = Ti - OCiApi, 

convergence test; 

fi + i = Mr i+ i] // preconditioning 

Pi = (ri+ur i+ i)/{fi,ri)‘, 

Pi + 1 = *i+ 1 + PiPil 

i++; 

} 


Program 1. The PCG iteration 
3 THE MULTIGRID METHOD 


In the iterative methods, the frequency components of the residual are reduced most rapidly on 
the grid corresponding to them. The multigrid method makes good use of this characteristic and 
exploits a lot of grids to converge as rapid as possible. 

These grids are leveled and numbered from the coarsest grid. This number is called the level 
number. If the multigrid method is applied to the solver of linear equations, the residual is reduced 
moving it from grid to grid. The basic element of the multigrid method is the defect correction 
principle. The defect correction scheme consists of three processes: pre-smoothing process, coarse 
grid correction and post-smoothing process. In the smoothing process, various methods, such as 
ILU, ADI and zebra relaxation, are proposed. One purpose of this research is, however, formation of 
an efficient method with high parallelism. Thus an iterative method with high parallelism, such as 
the damped Jacobi method or a multi-color symmetric SOR method (SSOR method), is used as the 
smoothing method. 

An operation of transferring a vector on a finer grid to a vector on a coarser grid is called 
restriction, and an opposite operator is called prolongation. A matrix presenting the operation of 
restriction is written r in this paper, and prolongation is p. 

In the following section, the equation of grid level i is described as 

— f i 

and restriction is defined by adjoint of prolongation. That is, 

r — bp T , 


where b, a scalar constant, is satisfied. 
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4 THE TWO-GRID PRECONDITIONER 


This section and the next section examine whether the multigrid method suits a preconditioner 
of the PCG method. First it is shown that two kinds of two-grid methods, one with pre-smoothing 
and no post-smoothing and the other with both pre-smoothing and post-smoothing, satisfy the 
conditions of a preconditioner: the matrix of the two-grid method is symmetric and positive 
definite. Next it is shown that V-cycle and W-cycle multigrid methods also hold. 

A linear equation, LiXi = f h is concerned. If R is a matrix of a relaxant calculation and u is an 
approximate vector, one two-grid iteration can be shown by matrix form in Table 1. 


u = H m u + Rf 

// pre-smoothing 

d = r ( Liu — /) 

// coarse grid correction 

v = Lf_\d 


u = u — pv 


u = H m u + Rf 

// post-smoothing 

Table 1. 

The two-grid iteration 


In this paper the relaxant calculation is an iterative method with high parallelism, and the 
matrix R is defined as follows. Let Lf be an n x n nonsingular, symmetric matrix and be split as 

Li = P — Q, (2) 

where P is a nonsingular matrix and the symmetric part of P + Q is positive definite. For example, 
in the case of the point Jacobi method, P is a diagonal matrix containing diagonal elements of L\. 
Then the Pth approximate vector u' is updated such as 

u i+1 =p- 1 Qu i + P~ 1 f. ( 3 ) 

If an initial approximate vector is zero vector and m iterations are done, R is equal to 

TO— I . .. ... 

R=J2 Hi P ~\ w ^h H = P~ l Q. (4) 

i=0 

H is called an iterative matrix. 


4.1 The two-grid preconditioner with pre-smoothing only 


First consider a no post-smoothing case. The matrix of one iteration of Table 1 equals 

M = (I- pLj^rLi) R + pLl\r 
= R + pLT_\r(I-L,R). 

Then the following theorem holds. 


( 5 ) 
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Theorem 1 The matrix Ljl\ is symmetric and positive definite, and N = I — LiR. If the matrix N 
and P are symmetric, the matrix M of Eq. (5) is symmetric in the N-energy inner product. If the 
matrix N is symmetric and nonsingular, the matrix P is symmetric and m is even; then the matrix 
M is positive definite in the N-energy inner product, provided that N-energy inner product 
(x,v)n = ( x,Ny ). 


Proof. Since N is symmetric, (I — LiR) T — I — LiR. Therefore 

I - R t Li = I - L { R. 

Since P is symmetric, the matrix R is also symmetric. Then 

RLi = L t R. ( 6 ) 

And 

(x,My) N = x T NRy + x T NpLj\r ( I - L t R ) y 

= x T {I-LiR)Ry + x T {I-L,R)pLT\r{I-LiR)y. (7) 


Besides 


{Mx,y)x = x T M T Ny 

= x T RNy + x T (I - LiR)pLj} x rNy 

= x T (I - LiR)Ry + x T (I - LiR)pLf\r(I - LiR)y (because of Eq. (6)) 

= (x,My) N . (8) 

Therefore the matrix M is symmetric in the jV-energy inner product. 

Next, it is shown that the matrix M is the positive definite in the N-e nergy inner product. It is 
equal to ( x,Mx)p? > 0. Then 

N = I -L t R 
= I -RL, 

771—1 

i - 0 

= ( P~ l Q) m 

= H m . 


Thus 


NM = {I-LiR^R + pL^ril-LiR)} 
= H m R + H m pLjf l rH m . 


Since P is symmetric and nonsingular and Li is symmetric and positive definite, then 
H — P~ l Q — I — P~ l Lt has real eigenvalues. Hence if m is even, H m is positive definite. If P + Q 
is positive definite and m is even, then R is positive definite (see [11]). Therefore H m R is positive 
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definite. Since H m is symmetric and pL { \r is semi-positive definite, H m pL t \rH m is semi-positive 
definite. Thus NM is positive definite. □ 

The iterative method which holds the assumption of Theorem 1 is the damped Jacobi method. 
From this theorem, the two-grid preconditioner with the damped Jacobi method as a relaxant 
calculation fills the conditions of the preconditioner of the CG method which uses the iV-energy 
inner product instead of the usual inner product. 

4.2 The two-grid preconditioner with both pre-smoothing and post-smoothing 

Next consider the two-grid iteration with both pre-smoothing and post-smoothing. Suppose the 
pre-smoothing and the post-smoothing are the same method. Then the matrix of one two-grid 
iteration in Table 1 equals 

M = H m {{I - P Ljl x rLi)R + pL^r} + R 

= H m R + R + H m pLj} i r(I-L l R). (9) 

However since P and Q are symmetric, 

I — LiR = (QP~ l ) m = (H T ) m . 

Therefore the matrix M of Eq. (9) is rewritten as 

M = H m R + R + H m P L( 1 l r{H T ) m . (10) 

Then the following theorem is satisfied. 

Theorem 2 The matrix Lf2i is symmetric and positive definite. If the matrix P is symmetric, the 
matrix M of Eq. (10) is symmetric and positive definite. 

Proof. Since the matrix P is symmetric, the matrix R is also symmetric. Thus 

M t = R ( H T ) m + R + H m pLY2 L r (H T ) m . 

Now 

m-1 

H m R = H m ]T) WP- 1 . 

*=0 

to— i 

R(H T ) m = Y2 H i P~ 1 {H T ) m . 

i = 0 

Moreover since P is symmetric and H = P~ X Q , then P~ l H T = HP 1 . Therefore 

H m R = R{H T ) m . 


626 


iiHinim i ii i mm 



After all, the matrix M is symmetric. Next show that the matrix M is positive definite. 

M = H m R + R + H m pL^ 1 r{H T ) m 

2m- i 

= WP- 1 + H m pLj2 1 r{H T ) m . (11) 

i=0 

2m— 1 

Since the first term of right hand expression H'P~ l of Eq. (11) is the matrix after 2m times 

t= o 

iteration, it is positive definite if P + Q is positive definite. Since is positive definite, 

(H T r is semi-positive definite. Therefore M is positive definite. 0 

The iterative methods which hold the assumption of Theorem 2 are the damped Jacobi method, 
Red-Black Symmetric Gauss-Seidel method (RB-SGS method), multi-color SSOR method, ADI 
method and so on. From this theorem, the two-grid preconditioner with one of these iterative 
methods as a relaxant calculation fulfills the conditions of the preconditioner of the CG method. 

5 THE MULTIGRID PRECONDITIONER 


In the previous section the possibility of two kinds of two-grid preconditioners is considered. In 
the following, only the latter two-grid preconditioner, with both pre-smoothing and post-smoothing, 
is discussed. However the same discussion can be applied to the former two-grid preconditioner. In 
this section, extension to the multigrid preconditioner is argued. The following theorem holds. 

Theorem 3 If assumptions of Theorem 1 and 2 are satisfied, all MG{m , n) methods (m, n > 1) 
satisfy conditions of a preconditioner of the CG method, where m is a multigrid cycle and n is the 
number of iterations of the smoothing method. 


Proof. The matrix M t of the V-cycle multigrid method can be defined as 

Mo - To 1 or i?o 

Mi = H m Ri + Ri + H m pMi. x r ( H T ) m . (i > 1) 

Mo is symmetric and positive definite. If Mj is symmetric and positive definite, M, +x is also 
symmetric and positive definite because of Theorem 2. By mathematical induction, every 
Mi(i > 0) is symmetric and positive definite. Therefore the V-cycle multigrid method can be used 
as a preconditioner. 

Next the W-cycle multigrid method is considered. If the matrix N is the multigrid method 
with n recursive calls of the multigrid method on level number l—l as the solution on the coarse 
grid, is defined as 

jV(S n) = Lo 1 or Rq 

N { i n) = Y^Hi n& {H m R i + R i + H m pN^\r(H T r}, (i > 1) 

i=0 
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where H' ng = H 2m — H m pN-™\rLiH m . Nq^ is symmetric and positive definite. If is symmetric 
and positive definite, H m R{ + R, + H m pN^\r (H T ) m is symmetric and positive definite by 
Theorem 2. Thus Nj ^ is also symmetric. And because of p(H mg ) < 1 by [?], JV}"> is positive 
definite. The W-cycle multigrid method is the case of n = 2. Therefore the W-cycle multigrid 
method and all MG(n, m) (m, n > 1) satisfy the conditions of the preconditioner. □ 

6 THE MGCG METHOD 

In the previous section, the multigrid preconditioner which is valid for a preconditioner of the 
CG method is considered. When only pre-smoothing is performed, the multigrid preconditioner 
with an even number of iterations of the damped Jacobi smoothing can become a preconditioner of 1 

the conjugate gradient method with the iV-energy inner product instead of the usual inner product. 

When both pre-smoothing and post-smoothing are performed, the multigrid preconditioner with 
RB-SSOR smoothing, ADI method and so on, fulfills the requirements of a preconditioner of the | 

conjugate gradient method. Thus the multigrid preconditioned conjugate gradient method (MGCG i 

method) is mathematically valid. There are several variations of this preconditioner. If m is a cycle ' 

of the multigrid method, / is a relaxant method, n is the number of iterations of the relaxant 
method and g is the number of grids, the MGCG method is specified as MGCG(I, m,n,g). But g is 
an optional parameter and if this parameter is omitted, all available grids are used. For example, 
MGCG(RB, 1,2) is the MGCG method of the V-cycle multigrid preconditioner with two iterations 
of the Red-Black SSOR smoothing. 

7 NUMERICAL EXPERIMENTS 

7. 1 Problems 

A two-dimensional Poisson equation with Dirichlet boundary condition: 

—V (kVu) = f in D = [0, 1] x [0, 1] 
with u — g on d£l, 

where k is a real function, is considered. The equation is defined by a diffusion constant k, a source 
term / and a boundary condition g. Numerical experiments are performed in the following two 
conditions. 

Problem 1 Diffusion constant is uniform and source term is equal to 0. Boundary co nd ition is 
<7 = 0 except y = 1 and g = 3x(I — x) on y = 1. 

Problem 2 Diffusion constant and source term are depicted by Figs. 1 and 2. Boundary condition 
g is always equal to 0. 
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Figure 1. Diffusion constant of problem 2 



Figure 2. Source term of problem 2 


Problem 1 is a simple case, and the multigrid method is expected to converge efficiently. The 
multigrid preconditioner is also expected to be efficient. Problem 2 has a non-uniform diffusion 
constant and the area with a large diffusion constant looks like a letter ‘T’; therefore it has a rich 
distribution of eigenvalues of the problem matrix, which is investigated in the next section. 

Moreover since a source term is complex, it does not happen that specific iterative methods, such as 
ICCG method and MICCG method, accidentally converge very rapidly. 

These problems are discretized to three kinds of meshes: 64 x 64, 128 x 128 and 256 x 256, by 
the finite element method. These coefficient matrices become symmetric, positive definite and block 

tridiagonal. 


7.2 Solutions 


In numerical experiments, three methods: the MGCG(iJB, 1, 2) method, the ICCG(1, 2) 
method and the MG(1, 2) method, are compared. The ICCG(1, 2) method is the PCG method 
with the incomplete Cholesky decomposition having an additional one line to the original problem 
sparse matrix. The MG(1, 2) method is the identical method to the multigrid preconditioner of the 
MGCG(JZB, 1, 2) method. 

Numerical experiments are performed on the HP9000/720 and the program is written by C++ 
with original vector and matrix classes. 
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7.3 Convergence of the MGCG method 


size 

MGCG(i25, 1,2) 

MGCG(flB,l,4) 

ICCG(1,2) 


mm 

# of iter. 

time(sec.) 

iter. 

time 

iter. 

time 

iter. 

time 

63* 

127 2 

255 2 

5 

5 

5 

0.56 

3.16 

15.8 

4 

5 
5 

0.61 

4.58 

23.7 

38 

72 

134 

1.19 

10.88 

89.5 

H 

0.65 

4.05 

20.2 


(HP90C 

Table 2. Problem 1 

0/720; C++) 



MGCG(i?B, 1, 2) 

MGCG(iZ5, 1,4) 

ICCG(1, 2) 

MG(1, 2) 

size 

# of iter. 

time(sec.) 

iter. 

time 

iter. 

time 

iter. 

time 

63 2 

9 

0.98 

8 

1.19 

53 

1.65 

150 

13.4 

127 2 

9 

5.54 

8 

7.21 

103 

15.49 

135 

75.3 

255 2 

9 

27.8 

8 

37.4 

200 

133.0 

122 

341.5 


(HP9000/720; C++) 


Table 3. Problem 2 

Tables 2 and 3 are results of these numerical experiments. The number of iterations and the time 
of each method until convergence are measured. The number of iterations of the MGCG method 
and the ICCG method is that of CG iterations and the number of iterations of the multigrid method 
is that of V-cycle iterations. From results of the two problems, the following points are notable: 


• The MGCG method converges with very few iterations. 

• The number of iterations of the MGCG method does not increase when a mesh size is larger. 

• Even for complex problems, such as problem 2, the MGCG method converges fast. 


The first item is discussed by an eigenvalue analysis in the next section. From the second item, the 
MGCG method is advantageous over the ICCG method even as large as the mesh size is. It is a 
principle of the multigrid method that the number of iterations does not depend upon the mesh 
size. If the problem is simple such as problem 1, the multigrid method converges very fast; however, 
in complex problems, such as problem 2, it converges very slowly. To avoid this, the multigrid 
method should have the stronger relaxation method, but the stronger relaxation method has poor 
parallelism. Moreover in problem 2, it is considered that the locking effect [?] has occurred. From 
the third item, the MGCG method is also superior to the multigrid method as a result of stably fast 
convergence and high parallelism. 
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8 EIGENVALUE ANALYSIS 


In order to study the efficiency of the multigrid preconditioner, the eigenvalue distribution of a 
coefficient matrix after preconditioning is examined. The number of iterations of the conjugate 
gradient method until convergence depends upon an initial vector, a distribution of eigenvalues of a 
coefficient matrix and a right-hand term, but due to a good initial vector and a simple right-hand 
term, the conjugate gradient method happens to converge fast unreasonably, so the eigenvalue 
distribution is investigated. The problem is the same problem in Section 7 and the area is 
discretized to the mesh of 16 x 16 by the finite element method. The condition number of this 
coefficient matrix is 5282.6. 

A matrix after the multigrid preconditioning is calculated as follows. The matrix M of Eq. (5) 
or (10) is Cholesky decomposed as M = U T U, then eigenvalues of the matrix UL l U T is investigated. 
On the other hand the matrix using the ICCG method is calculated as follows. The matrix L\ is 
incomplete Cholesky decomposed as L, = S T S - T, and the general eigenvalue problem 
L(X = A S T Sx is solved in order to examine eigenvalues after preconditioning. 



Figure 3. Eigenvalue distribution of a problem 
matrix 



Figure 4. Eigenvalue distribution after 
preconditioning 


The eigenvalue distribution of the problem matrix is shown in Fig. ??. The horizontal x axis is 
the order of the eigenvalues and the vertical y axis values are the eigenvalues. This vertical axis is in 
a log scale. The eigenvalue distribution of the matrix after preconditioning is shown in Fig. ??. This 
vertical axis is in a linear scale. In order to compare, preconditioning is carried out in both the 
multigrid method and the incomplete Cholesky decomposition. 

The eigenvalue distribution of the multigrid preconditioner is effective for the conjugate gradient 
method as the following points: 
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1. Almost all eigenvalues are clustered around 1 and a few eigenvalues are scattered between 1 
and 0. 

2. The smallest eigenvalue is larger than the ICCG method. 

3. Condition number is decreased. 


The first item is no problem for the conjugate gradient method. All of these characteristics axe 
desirable to accelerate the convergence of the conjugate gradient method. In problem 1, there are 
no scattered eigenvalues. So the multigrid method converges efficiently, however in problem 2, the 
scattered eigenvalues prevent the ordinary multigrid method from converging rapidly. Therefore 
using the multigrid method as a preconditioner of the conjugate gradient method is quite important. 

9 CONCLUSION 

This paper investigates the conjugate gradient method with a multigrid preconditioner (MGCG 
method). Necessary conditions of a preconditioning matrix of the conjugate gradient method are 
symmetric and positive definite. First two kinds of two-grid preconditioners are considered and 
conditions of both preconditioners are given in order to satisfy necessary conditions of a 
preconditioner. Secondly extension to the multigrid preconditioner is carried out and conditions for 
a valid multigrid preconditioner are also given. Thirdly numerical experiments are performed and 
the MGCG method has faster convergence and a more effective method than both the ICCG 
method and the multigrid method. Finally eigenvalue analysis is performed in order to verify the 
effect of the multigrid preconditioner. It concludes that the multigrid preconditioner is an excellent 
preconditioner and it improves the number of the CG iterations remarkably. Consequently the 
MGCG method has the following properties: 

• The number of iterations does not increase even when a mesh is finer. 

• Even in the case that the problem is ill-conditioned, the MGCG method is effective. 

• The distribution of the eigenvalues of the matrix after preconditioning is suited to the 
conjugate gradient method. 

• The MGCG method has high parallelism. 

The multigrid method roughly solves any problems, since almost all eigenvalues of Section ?? are 
clustered around the unity, but a few scattered eigenvalues prevent fast convergence. The conjugate 
gradient method hides the defect of the multigrid method. Therefore the MGCG method becomes 
an efficient method. Parallelization of the MGCG method and implementation on the 
multicomputers are beyond the scope of this paper, so this facility is no more me nt ioned. However 
since the MGCG method has high parallelism and feist convergence, this method is a very promising 
method as the solution of a large-scaled sparse, symmetric and positive definite matrix. 
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SUMMARY 


The convergence rate of standard multigrid algorithms degenerates on problems with 
stretched grids or anisotropic operators. The usual cure for this is the use of line or plane 
relaxation. However, multigrid algorithms based on line and plane relaxation have limited and 
awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer 
multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than 
line relaxation are better suited to massively parallel architectures because they require only simple 
point-relaxation smoothers. 

In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid 
(MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon 
computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation 
or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism 
must be exploited to map these algorithms effectively to parallel architectures. This paper describes 
a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both 
levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for 
distributed-memory machines. 


'This research was supported by the National Aeronautics and Space Administration under NASA contract nos. 
NAS1-19480 and NAS1-18605 while the second author was in residence at the Institute for Computer Applications in 
Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001. 
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INTRODUCTION 


The convergence rate of standard multigrid algorithms degenerates on problems that have 
anisotropic discrete operators. Such operators arise when the continuous operator is anisotropic or 
when the discretization is based on highly stretched grids. Although a number of effective cures 
exist for this difficulty, the best sequential algorithms (based on line or plane relaxation) do not 
appear to be viable on emerging, massively parallel architectures. Thus, newer algorithms, which 
achieve robustness through the use of multiple coarse grids rather than line or plane relaxation and 
require only point-relaxation smoothers, are an attractive alternative. 

The problems with line- and plane-relaxation algorithms on parallel architectures have only 
recently become apparent. Although the tridiagonal systems involved can be solved in parallel by 
substructured elimination, for example, this approach approximately doubles their computational 
cost. In addition, a more subtle difficulty exists. The fastest robust sequential algorithms combine 
line- and plane-relaxation algorithms with semicoarsening. Unfortunately, this means that the size 
of the line and plane solutions required on coarse grids is the same as on the fine grid. For example, 
an n 2 -point grid in two dimensions with a parallel tridiagonal solver and 0(n 2 ) processors gives a 
theoretical upper bound on parallel efficiency of only 0(1/ log 2 n). Thus, the fact that parallel 
implementations of such algorithms have proven problematic is not surprising (refs. 1,2,3). 

An alternate approach to robustness, based on using multiple grids on every coarse multigrid 
level, is newer and relatively untried. Through the use of appropriate coarse grids, one can obtain 
point-relaxation algorithms as robust as line- and plane-relaxation algorithms (refs. 4, 5, 6, 7). 
However, because of the large number of coarse grids required, these algorithms are not quite 
competitive with line- and plane-relaxation algorithms on sequential machines. On parallel 
architectures, the opposite is true (refs. 5,8,9) because the increased parallelism due to the multiple 
coarse grids is an attractive bonus. In particular, Douglas’ method is robust and can be mapped 
effectively to parallel architectures (ref. 5); Horton (ref. 9) has looked recently at the mapping of 
Hackbusch’s Frequency Decomposition method (ref. 6) to parallel architectures. 

In this paper, we study the mapping of the multiple semicoarsened grid (MSG) algorithm, a 
variant of Mulder’s multiple coarse-grid algorithm (ref. 10), to highly parallel architectures. The 
MSG algorithm (ref. 7) is relatively robust and at the same time provides ample parallelism for 
current parallel architectures. We take as our model problem the symmetric, positive-definite 
Helmholtz equation 

a u xx + b u yy + c u zz - d u = f 

with a, 6, c, d > 0 and focus on the mapping issues involved in implementing this algorithm on 
distributed-memory architectures such as the Intel iPSC/860 and Paragon. 

This paper is organized as follows. We begin with a description of the MSG algorithm in the 
next section, which is followed by a discussion of observed convergence rates. Our parallel 
implementation is then described. We present the experimental results, and, finally, conclusions are 
given. 


ALGORITHM DESIGN 


We first need to describe the MSG algorithm. For notational simplicity, assume that t e 
domain of the model problem is the unit square in two dimensions and that this problem is to be 

solved on an n x n uniform grid as 

n k = i(ihjh) I i = 0, 1, . . . , n; j = 0 , 1 ,..., n} 

with h = 1 In. Define the coarser grids D i,TO , which are obtained by successive semicoarsening of 0^ 

/ times in the x-direction and m times in the y-direction. Thus, D'’ TO has (n + l)/2 grid points m 
the x-direction and (n 4- l)/2 m grid points in the y-direction. 

Notice that the notation does not distinguish between a grid obtained by semicoarsening first 
in the y-direction and then in the x-direction and a grid obtained by semicoarsenmg first in the 
x-direction and then in the y-direction. Either path leads to a grid of the same shape and size. As 
shown by Mulder (ref. 10), such equivalent grids must be combined in order to construct reasonable 

algorithms in three or more dimensions. 

Figure 1 shows the interrelations between the various grids for a two-dimensional problem 
with an 8 x 8 fine grid. With coarse grids combined as in this diagram, for a 16 x 16 problem one 
would have only 16 grids altogether; without combining, the full binary tree of grids would contain 

69 grids. 



level 1 


level 2 


level 3 


level 4 


level 5 


Figure 1. Semicoarsening of an 8 x 8 grid. 

Given this family of grids, one can construct a V-cycle correction scheme analogous to the 
standard full-coarsening multigrid algorithm. One-dimensional linear interpolation provides a 
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natural prolongation operator; its adjoint gives the “full weighting” restriction operator. These 

choices, together with any reasonable smoother, yield a multigrid algorithm. However, the resulting 
algorithm is not robust. 

The problem with this simple correction scheme is explained. If the prolongation is scaled so 
that the full correction is obtained from the modes that are oscillatory in x but not y and 
conversely, then the result is double the required correction of the smoothest components that 
belong to both coarse grids, and divergence results. On the other hand, if the prolongation is scaled 
to get the proper correction of the smooth components, then some of the oscillatory components are 
undercorrected, and robustness is lost. 

The resolution of this problem is to filter either the residuals that are being restricted or the 
corrections that are being prolonged to achieve a convergent V-cycle for the model problem 

® b Uyy = f 

where the convergence rate is independent of a, 6 > 0. This filtering can be performed in several 
ways. 


Le.t v ■ denote the correction on grid Also let R x and R y denote restriction in the x- 

and y-directions, and, similarly, let P x and P y denote prolongations. The first effective solution to 
this problem was given by Mulder (ref. 10). Mulder forms the fine-grid correction 

P x v l '° + P x R x P y u 0 - 1 

given solutions v and u on the second level and similar solutions for coarser levels. One can 
think of the operator P X R X here as a high-pass filter that filters out the excess correction for the 
smooth modes common to both coarse grids. 

In recent work, Naik and Van Rosendale have been looking at the analogous scheme with the 
correction 

(1 + 1/2 P y R y )P x V 1 - 0 + (1 + 1/2 P x R x )P y v 0 ’ 1 

which can be thought of as a symmetric version of Mulder’s scheme. A V-cycle proof for one variant 
of this scheme appears to be possible. 

A third way of making the correction is to compute a scalar-valued function a(x,y ), which 
depends on the strength of the discrete differential operator in each coordinate direction. Then, 
with a properly choosen a , one uses the correction 

a(x,y)P x v 1 ' 0 + [1 — a(x,y)]P y v 0 ' 1 

A V-cycle convergence proof for this scheme, at least for constant coefficient problems, was given in 
ref. 7. This reference also provides details on the computation of a(x, y). 

On sequential machines, any of these schemes is effective and robust. Mulder’s scheme and its 
symmetrized version eliminate the necessity of choosing a; the extra work involved in their 
interpolations is trivial. However, because the communication required for interpolation is awkward 
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and expensive on parallel architectures, we used the alpha-switch algorithm here, which reduces the 
complexity of the interpolations. It is as robust as the alternatives and simpler to implement. 

Generalization of this alpha-switch algorithm to the three dimensions is straightforward. 
Instead of simply computing a(#,y), one computes a(x,y,z) and (3(x,y,z) and then uses the three 
weights 

a{x,y,z) P{x,y,z) 1 - a{x,y,z) - P{x,y,z) 

From the point of view of parallel architectures, computation of the switching factors a and (3 is 
analogous to a Jacobi sweep, which needs to be done only once at the beginning of the computation. 

OBSERVED CONVERGENCE RATES 


Experimentally, the MSG algorithm converges extremely well for the model problem 

CL U x x “b ^ ^yy T 0 LL ZZ d Li J 

where the convergence rate is independent of a, fc, c, d > 0 and uniform mesh size. Alternatively, 

MSG can be used for stretched grids, as shown in Table 1. The results given are observed 
convergence rates for Poisson’s equation with Dirichlet boundary conditions and a random initial 
guess. Slow variation in the coefficients a, 6, c or in mesh spacing have a similar impact on 
convergence. The Helmholtz term d > 0 can improve convergence on coarse grids, but is largely 
irrelevant. All of the above information applies only to problems with smooth coefficients. Special 
algorithms are required for problems with severe coefficient jumps (refs. 11,3). The discretization 
used throughout our experiments was a symmetric seven-point finite-difference stencil, with the 
smoothing done by three red-black successive over-relaxation (SOR) sweeps on every grid. 

The problem with this algorithm on sequential machines is the large number of grids required 
and the resulting high cost per V-cycle. With the usual coarsening by a factor of 2 (as shown in 
Table 1), the total storage for all grids in three dimensions is eight times that of the finest grid. 
Thus, the work per V-cycle is also eight times the work on the finest grid, which does not include 
the cost of the interpolations. 

A more attractive sequential algorithm can be made by changing the coarsening factor. In any 
semicoarsening algorithm, one has fewer Fourier modes to reduce than in full-coarsening algorithms; 
thus, one can afford to coarsen the grids faster. 

If we use coarsening by a factor of 4, for example 2 , then the total storage becomes 

(1 + 1/4 + 1/16 + ...) 3 = 64/27 

times that on the finest grid. Thus, the total work is about 2| times that on the finest grid. 

2 The red-black SOR smoother used yields poor convergence rates for odd coarsening factors. Thus, the reasonable 
choices for the coarsening factor are 2 and 4 because either 6 or 8 would make the space of “oscillatory 5 functions 
(which must be effectively reduced by the smoother) too large. 
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Table 1. Convergence Rates of MSG on Various Grids With Factor-of-2 Coarsening 



8x8x8 

16 x 16 x 16 

32 x 32 x 32 

Uniform Grids 




dx = 1000, dy = dz = 1 

0.04 

0.06 

0.07 

dx = 10, dy = dz — 1 

0.04 

0.06 

0.08 

dx = 0.1, dy = dz = 1 

0.02 

0.05 

0.07 

dx = 0.001, dy = dz = 1 

0.03 

0.07 

0.08 

Chebyshev Grids 




Chebyshev in x 

0.04 

0.06 

0.11 

Chebyshev in x, y 

0.04 

0.04 

0.12 

Chebyshev in x,y,z 

0.03 

0.04 

0.15 


Table 2 gives the observed convergence rates for the same problems as in Table 1 ; however, 
factor-of-4 coarsening was used. Although the convergence rates in Table 2 are poorer than in Table 
1, the reduced computational cost per V-cycle more than compensates for this. Three V-cycles of 
the algorithm can be accomplished with factor-of-4 coarsening for less than the cost of one V-cycle 
with a factor-of-2 coarsening. With the 32 3 grid, because 0.3 3 = 0.027, the three V-cycles with a 
factor-of-4 coarsening are more effective than one V-cycle with a factor-of-2 coarsening. 

Massively parallel architectures that have hundreds or thousands of processors might change 
these considerations and increase the effectiveness of the algorithm with a factor-of-2 coarsening 
because it provides more parallelism on coarse grids. However, because the algorithm with a 
factor-of-4 coarsening seemed to provide ample parallelism and the memory per processor is limited 
on the Intel iPSC/860, we used a factor-of-4 coarsening in our code. 

In addition to the use of a factor-of-2 coarsening, the parallelism can be further increased by 
use of concurrent iteration on all grid levels (refs. 12,13). This form of MSG is particularly 
attractive on SIMD machines, where the mapping strategies needed for the V-cycle algorithm are 
prohibitively complex. In joint research with J. Dendy, this alternative is currently being explored 
for problems with severe coefficient jumps. However, while the concurrent iteration version of MSG 
maps very nicely to SIMD machines (ref. 7), its convergence rate is in the range of 0.5— 0.6, even 
with a factor-of-2 coarsening. Thus, one trades numerical performance for massive parallelism. 
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Table 2. Convergence Rates of MSG on Various Grids With Factor-of-4 Coarsening 



8x8x8 

16 x 16 x 16 

32 x 32 x 32 

Uniform Grids 

dx = 1000, dy = dz= 1 

0.21 

0.20 

0.23 

II 

** 

II 

5* 

O 

II 

H 

0.21 

0.20 

0.24 

dx = 0.1, dy = dz = 1 

0.11 

0.13 

0.18 

dx = 0.001, dy = dz = 1 

0.11 

0.15 

0.14 

Chebyshev Grids 
Chebyshev in x 

0.19 

0.18 

0.26 

Chebyshev in x,y 

0.11 

0.14 

0.25 

Chebyshev in x, y, z 

0.05 

0.19 

0.26 


MAPPING MSG TO SCALABLE ARCHITECTURES 


The V-cycle MSG algorithm achieves fast convergence and contains substantial parallelism, 
although exploitation of this parallelism is fairly awkward. This awkwardness is in contrast to the 
standard (full-coarsening) multigrid, where parallel implementation is straightforward. For the 
MSG case, we designed a program to compute efficient mappings of the algorithm to a 
distributed-memory architecture. The computed mappings were then implemented with the 
PARTI 3 runtime primitives developed at ICASE (refs. 14,15). Although this implementation was 
complex, without PARTI or analogous tools, implementation would have been prohibitively 
difficult. In this section, we describe our implementation strategy. 

Load Balancing 


Two basic issues must be addressed in mapping the V-cycle MSG algorithm to 
distributed-memory architectures: processors must be assigned to the grids on each level and each 
grid must be partitioned across the processors assigned to it. Because a large number of possible 
mapping strategies exist, we made two major simplifying choices. First, we chose to map each 
multigrid level independently of the mapping of all other levels. Second, if the number of processors 
was greater than the number of grids on a level, we chose to assign each processor to, at most, one 

3 PARTI is an acronym for Parallel Automated Runtime Toolkit at ICASE. 
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grid on that level. 

The first assumption is justified by the observation that the smoothing iteration is more 
frequent and more computationally intensive than the interpolation, so that the achievement of a 
good mapping during the smoothing step is crucial to performance. Also, any mapping that 
achieves an approximate load balance during the smoothing step is bound to induce a large amount 
of communication during interpolation. One reason for this is that the number of grids on each level 
almost always differs from the number on neighboring levels; thus, no mapping exists that 
simultaneously minimizes communication and achieves load balance. 

The second assumption that each processor is assigned to no more than one grid on every level 
was taken to minimize communication, although it does induce some load imbalance. For example, 
suppose one has three grids on a level to be split over eight processors. Then each grid would ideally 
receive 2.66 processors. However, such a mapping is complex and clearly increases communication. 
Instead, one grid would be assigned to two processors, and the other two grids to three each. 

In the current implementation, we did not split processors across grids. Instead, we carefully 
determined those grids that should get fewer and those that should get more processors to achieve 
approximate load balance without splitting processors across grids. In general, long thin grids (grids 
with one array dimension much smaller than the others) induce less communication when split over 
multiple processors than fat grids (grids with all array dimensions about equal). Thus, one 
maximizes load balance by assigning excess processors to the fattest grids. 

Given these preliminaries, our load balancing algorithm follows. By assuming one has p 
processors and more processors than grids on all multigrid levels, the algorithm for distributing 
processors to grids is 

Assign p processors to the finest grid 
For level := 2 to maxJevel { 

ngrids := number _of_grids(level) 
assign [p/ ngrids J processors to each grid 
p^excess := p — ngrids \p/ngrids\ 

assign one more processor to each of the p.excess fattest grids 


We call this the maximally distributed strategy. 

This algorithm gives a distribution of processors to grids. Afterwards, one still has to partition 
each grid across the processors. To do this, we blocked the finest grid across processors in all three 
directions; coarser grids were blocked in one direction. One reason for this choice is that coarser 
grids often have an odd or prime number of processors, so that partitioning in more than one 
direction can be quite awkward. In all cases, the direction in which the coarser grids were blocked 
was chosen to minimize interprocessor communication. 

In an alternate implementation referred to as the aligned strategy, all coarse grids were aligned 
to the finest grid, which requires each coarse grid to be partitioned among the full set of processors. 
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Although this strategy will eliminate communication during the interpolation, it leads to increased 
communication within a single grid and may quickly lead to idle processors. In the future, a 
strategy that uses a combination of the two described above may be implemented. In this hybrid 
implementation, coarse grids would be aligned in the first few levels; on lower levels, individual grids 
would be assigned to only a subset of processors. 

PARTI Implementation 

As stated, the MSG algorithm was implemented in parallel with the multiblock PARTI 
routines. The multiblock library was designed to support block-structured aerodynamics codes in 
which one uses multiple, logically rectangular grid blocks to resolve complex aerodynamic 
geometries (ref. 16). Because the structure of such codes is fairly similar to that of MSG, we found 
that the same routines could be effectively used to implement this algorithm. 

The PARTI library for block-structured codes allows multiple grid blocks to be processed in 
parallel and carries out the necessary communication required to move information among the grids. 
In our parallel implementation that maps coarse grids to subsets of processors, an individual 
“decomposition” is defined for the fine grid and for each coarser grid. In order to have all processors 
active on the finest grid, the fine-grid decomposition is embedded into the entire processor space. 
Then, for each subsequent level, the coarse-grid decompositions are embedded into an 
approximately equal portion of the processor space, as described in the last section. The single 
coarse grid on the coarsest level contains few points so it is mapped to one physical processor. 

Our parallel version reads a file that holds the grid mapping and distribution information. A 
subroutine was created to use this mapping information along with the appropriate PARTI routines 
to set up the problem. As in most multigrid codes, the sequential code uses several large arrays to 
hold the residual, solution, and right-hand-side data for all grids on all levels. Individual grid sizes 
and starting index locations into the large arrays are computed and passed as parameters to 
subroutines. This strategy was maintained in the parallel version; however, the sizes and starting 
locations were modified to reflect the parallelism and the additional space required for holding 
boundary data for those grids distributed over more than one processor. 

While PARTI aims to require minimal changes to the sequential source program, our parallel 
implementation was 20 to 25 percent larger than the original sequential program, and some 
subroutines required an extensive rewrite. Emerging FORTRAN dialects, like High Performance 
FORTRAN, FORTRAN D, and Vienna FORTRAN, may soon ease this programming burden. 
However, the current versions of these languages are not expressive enough to allow mapping 
strategies as complex as those described in this paper. The improvement of such languages, and of 
software tools like PARTI, is an area of active research at ICASE and elsewhere. The present 
situation, in which the effective mapping of an algorithm to a parallel architecture is an arduous 
task of many months, is clearly unacceptable. 
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EXPERIMENTAL RESULTS 


We recently implemented this algorithm and the mapping strategy on a 32-node Intel 
iPSC/860 and will soon migrate this program to a 64-node Intel Paragon and possibly a CM-5. The 
current results are preliminary, but are sufficiently encouraging to suggest the relative efficacy of this 
class of algorithms. For a problem with 16 3 mesh cells, the achieved efficiencies are given in Table 3. 

Table 3. Efficiency of Problem With 16 3 - Point Grid on iPSC/860 


Processors 

i 

2 

4 

8 

16 

Efficiency 

1.0 

.83 

.66 

.42 

.25 


Table 4. MSG Performance on the Intel iPSC/860 


Size 

Nodes 

Total Time 
(secs) 

V-cycle Time, (secs) 

First V-cycle 

Subsequent V-cycles 


i 

6.96 

3.07 

1.22 


2 

4.21 

1.70 

.804 

16 3 

4 

2.63 

1.05 

.508 


8 

2.07 

.925 

.373 


16 

1.71 

.793 

.302 


4 

22.6 

11.6 

3.55 

32 3 

8 

13.5 

7.15 

2.03 


16 

8.39 

4.59 

1.23 


32 

5.27 

2.61 

.867 

64 3 

16 

49.5 

28.8 

6.63 


32 

24.1 

12.1 

3.87 


These efficiencies were computed relative to the parallel implementation run on one node. A 
large amount of overhead can be incurred with the runtime software. For the 16® problem, the 
parallel code run on one processor takes approximately four times longer than the sequential code 
that contains no PARTI calls. For larger problems, the overhead should become less significant. 

Another issue here is the choice of stencil. With the 7-point stencils used, the 
communication/computation ratio is four times greater than for 27-point stencils, and our 
efficiencies are correspondingly lower. However, the PARTI library does not currently update the 
corner ghost points needed for the 27-point stencils, so we were restricted to the use of 7-point 
stencils. This restriction will be changed in the next release of the library. 
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Log # processors 

Figure 2. Execution time versus number of processors. 


Table 4 shows performance results for several problem sizes. The table contains the overall 
program timings, along with the timings for each V-cycle. The results show the extra time required 
in the first V-cycle for setting up the communication schedules. These schedules are saved an , 
therefore, do not need to be recomputed on subsequent iterations. 


Figure 2 expands on the data in Table 4. The graph shows that the 32 problem run on 4 
nodes requires approximately the same amount of time as the 64 3 problem run on 32 nodes. ns 
result is to be expected because the 64 3 problem has about eight times as much work In Figure 2, a 
horizontal connecting line between the two cases (the dashed line on the graph would indicate the 
achievement of perfect memory-bounded speedup (ref. 17); however, because of various overheads, 

this line slopes slightly. 


The number of cases plotted here was constrained by current limitations of the PARTI library. 
For example, we were unable to obtain any timings on the machine that used more than 32 
processors. Also, because of the large amount of memory consumed by the PARTI communication 
library, the user memory available on each processor decreased. These problems should be resolved 
in future releases of the PARTI library. The multiblock library is in a preliminary stage We expect 
that further optimizations will improve the performance of block-structured codes i with t e 
multiblock library. The performance effects of some optimizations made to the PARI 1 primitives 
used in unstructured codes are described in ref. 18. 


Alternate Mapping Strategies 

We have also experimented with the aligned mapping strategy that was described briefly in 
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the previous section. With this strategy, the cost of the first V-cycle is much lower than in the 
maximally distributed strategy because the communication that occurs in the interpolation is easier 
to analyze. However, subsequent V-cycles are more expensive than in the maximally distributed 
strategy. This difference seems to be due both to the increased communication within each grid 
(because each grid is subdivided more finely) and to the sequentialization of all grids on every level. 
As a result, the aligned strategy is less effective than the maximally distributed strategy, even 
though it reduces inter processor communication during the interpolation. 4 In future work, we plan 
to study various hybrid strategies like those proposed in ref. 9 that combine the advantages of both 
the aligned and maximally distributed strategies. 


CONCLUSIONS 


We have examined the parallel implementation of a multigrid algorithm based on multiple 
coarse grids. Such multigrid algorithms have a fast convergence that is independent of grid 
stretching and can be effectively mapped to highly parallel architectures. We have developed a 
strategy for mapping such algorithms to parallel machines and have given preliminary results on the 
effectiveness of this strategy in mapping MSG to the Intel iPSC/860. The PARTI library is being 
ported to the Intel Paragon; we plan to try our algorithms on this larger machine in the near future. 
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Abstract 

In this work the compressible Euler equations are solved using finite volume tech- 
niques on uns tructured grids. The spatial discretization employs a central difference 
approximation augmented by dissipative terms. Temporal discretization is done using 
a multistage Runge-Kutta scheme. A mul tigrid technique is used to accelerate conver- 
gence to steady state. The coarse grids Eire derived directly from the given fine grid 
through agglomeration of the control volumes. This agglomeration is accomplished 
by using a greedy-type algorithm and is done in such a way that the load, which is 
proportional to the number of edges, goes down by nearly a factor of 4 when moving 
from a fine to a coarse grid. The agglomeration algorithm has been implemented and 
the grids have been tested in a multigrid code. An area-weighted restriction is applied 
when moving from fine to coarse grids while a trivial injection is used for prolongation. 
Across a range of geometries and flows, it is shown that the agglomeration multigrid 
scheme compares very favorably with an unstructured multigrid algorithm that makes 
use of independent coarse meshes, both in terms of convergence and elapsed times. 


1 Introduction 

Multigrid techniques have been successfully used in computational aerodynamics for over a 
decade [1, 2]. The main advantage of the multigrid method when solving steady flows is the 
enhanced convergence while requiring little additional storage. In addition, multigrid can 
be used in conjunction with any convergent base scheme, with adequate care exercised in 
constructing proper restriction and prolongation operators between the grids. Perhaps the 
biggest advantage of multigrid is the fact that it deals directly with the nonlinear problem 
without requiring an elaborate linearization and the attendant storage required to store 
the matrix that arises from the linearization. Thus, multigrid techniques have enabled the 
practical solution of complex aerodynamic flows using millions of grid points. 

The initial efforts in multigrid were directed towards the solution of flows on structured 
grids where coarse grids can easily be derived from a given fine grid. Typically, this is done by 
omitting alternate grid lines in each dimension. These ideas have been extended to triangular 
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grids in two dimensions and to tetrahedral meshes in three dimensions [3, 4, 5, 6]. In previous 
work by the second author, a sequence of unnested triangular grids of varying coarseness is 
constructed [3]. Piecewise linear interpolation operators are derived during a preprocessing 
step by using efficient search procedures. The residuals are restricted to coarse grids in a 
conservative manner. It has been shown that such a scheme can consistently obtain conver- 
gence rates comparable to those obtained with existing structured grid multigrid methods. 
For complex geometries, especially in three dimensions, however, constructing coarse grids 
that faithfully represent the complex geometries can become a difficult proposition. Thus, 
it is often desirable to derive the coarse grids directly from a given fine grid. 

The agglomeration multigrid strategy has been investigated by Lallemand et al. [7] and 
Smith [8]. Lallemand et al. use a base scheme where the variables are stored at the vertices 
of the triangular mesh, whereas Smith uses a scheme that stores the variables at the centers 
of triangles. In the present work, a vertex-based scheme is employed. Two dimensional 
triangular grids contain twice as many cells as vertices (neglecting boundary effects), and 
three dimensional tetrahedral meshes contain 5 to 6 times more cells than vertices. Thus, 
on a given grid, a vertex scheme incurs substantially less computational overhead than a 
cell-based scheme. Increased accuracy can be expected from a cell-based scheme, since this 
involves the solution of a larger number of unknowns. However, the increase in accuracy 
does not appear to justify the additional computational overheads, particularly in three 
dimensions. 

The main idea behind the agglomeration strategy of Lallemand et al. [7] is to agglomerate 
the control volumes for the vertices using heuristics. The centroidal dual, composed of 
segments of the median of the triangulation, is a collection of the control volumes over 
which the Euler equations in integral form are solved. On simple geometries, Lallemand et 
al. were able to show that the agglomerated multigrid technique performed as well as the 
multigrid technique which makes use of unnested coarse grids. However, the convergence 
rates, especially for the second order accurate version of the scheme, appeared to degrade 
somewhat. Furthermore, the validation of such a strategy for more complicated geometries 
and much finer grids, as well as the incorporation of viscous terms for the Navier-Stokes 
equations, remains to be demonstrated. The work of Smith [8] constitutes the basis of a 
commercially available computational fluid dynamics code, and as such has been applied to a 
number of complex geometries [9] . However, consistently competitive multigrid convergence 
rates have yet to be demonstrated. 

In the present work, the agglomeration multigrid strategy is explored further. The issues 
involved in a proper agglomeration and the implications for the choice of the restriction 
and prolongation operators are addressed. Finally, flows over non-simple two-dimensional 
geometries are solved with the agglomeration multigrid strategy. This approach is compared 
with the unstructured multigrid algorithm of Mavriplis [3] which makes use of unnested 
coarse grids. Convergence rates as well as CPU times on a Cray Y-MP/1 are compared 
using both methods. 




650 



2 Governing equations and discretization 


The Euler equations in integral form for a control volume Q with boundary dQ read 

itL udv+ L nu ' n)iS =°- (1) 


Here u is the solution vector comprised of the conservative variables: density, the two com- 
ponents of momentum, and total energy. The vector F(u,n ) represents the inviscid flux 
vector for a surface with normal vector n. Equation (1) states that the time rate of change of 
the variables inside the control volume is the negative of the net flux of the variables through 
the boundaries of the control volume. This net flux through the control volume boundary 
is termed the residual. In the present scheme the variables are stored at the vertices of 
a triangular mesh. The control volumes are non-overlapping polygons which surround the 
vertices of the mesh. They form the dual of the mesh, which is composed of segments of 
medians. Associated with each edge of the original mesh is a (segmented) dual edge. The 
contour integrals in Equation (1) are replaced by discrete path integrals over the edges of the 
control volume. Figure 1 shows a triangulation for a four-element airfoil and Figure 2 shows 
the centroidal dual. Each cell in Figure 2 represents a control volume. The path integrals 
are computed by using the trapezoidal rule. This can be shown to be equivalent to using a 
piecewise linear finite-element discretization. For dissipative terms, a blend of Laplacian and 
biharmonic operators is employed, the Laplacian term acting only in the vicinity of shocks. 
A multi-stage Runge-Kutta scheme is used to advance the solution in time. In addition, local 
time stepping, enthalpy damping and residual averaging are used to accelerate convergence. 
The principle behind the multigrid algorithm is that the errors associated with the high 
frequencies are annihilated by the carefully chosen smoother (the multi-stage Runge-Kutta 
scheme) while the errors associated with the low frequencies are annihilated on the coarser 
grids where these frequencies manifest themselves as high frequencies. In previous work [3], 
as well as in the present work, only the Laplacian dissipative term (with constant coefficient) 
is used on the coarse grids. Thus the fine grid solution itself is second order accurate, while 
the solver is only first order accurate on the coarse grids. 


3 Details of agglomeration 

The agglomeration (referred to also as coarsening) algorithm is a variation on the one used 
by Lallemand et ad. [7] and is given below: 

1. Pick a starting vertex on the surface of one of the airfoils. 

2. Agglomerate control volumes associated with its neighboring vertices which are not 
already agglomerated. 

3. Define a front as comprised of the exterior faces of the agglomerated control volumes. 
Place the exposed edges in a queue. 


4. Pick the new starting vertex as the unprocessed vertex incident to a new starting edge 
which is chosen from the following choices given by order of priority: 
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Figure 1: Grid about a four-element airfoil. 


• An edge on the front that is on the solid wall. 

• An edge on the solid wall. 

• An edge on the front that is on the far field boundary. 

• An edge on the far field boundary. 

• The first edge in the queue. 

5. Go to Step 2 until the control volumes for all vertices have been agglomerated. 

There are many other ways of choosing the starting vertex in Step 4 of the algorithm, but we 
have found the above strategy to be the best. The efficiency of the agglomeration technique 
can be characterized by a histogram of the number of fine grid cells comprising each coarse 
grid cell. Ideally, each coarse grid cell will be made up of exactly four fine grid cells. The 
various strategies can be characterized by how close they come to this ideal case. One 
variation is to pick the starting edge randomly from the edges currently on the front. Figure 
3 shows a plot of the number of coarse grid cells as a function of the number of fine grid cells 
comprising them, with our agglomeration algorithm described above, and with the variation. 
It is clear that our agglomeration algorithm is superior to the variant. The number of coarse 
grid cells having exactly one fine cell (singletons) is also much smaller with our algorithm 
compared to the variant. We have also investigated another variation where the starting 
vertex in Step 4 is randomly picked from the field and this turns out to be much worse. It 
is possible to identify the singleton cells and agglomerate them with the neighboring cells, 
but this has not been done. 

The procedure outlined above is applied recursively to create coarser grids. Figure 4 
shows an example of the agglomerated coarse grid. The boundaries between the control 
volumes on the coarse grids are composed of the edges of the fine grid control volumes. We 
have observed that the number of such edges only goes down by a factor of 2 when going from 
a fine to a coarse grid. Since the computational load is proportional to the number of edges, 
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Figure 2: Centroidal dual for the triangulation of Figure 1. 


this is unacceptable in the context of multigrid. However, if we recognize that the multiple 
edges separating two control volumes can be replaced by a single edge connecting the end 
points, then the number of edges does go down by a factor of 4. Since only a first order 
discretization is used on the coarse grids, there is no approximation involved in this step. 
If a flux function that involved the geometry in a nonlinear fashion were used, such as the 
Roe’s approximate Riemann solver, this is still a very good approximation. It may also be 
seen from Figure 4 that once this approximation is made, the degree of a node in this graph 
is still 3 i.e., each node in the interior has precisely three edges emanating from it. Thus 
the agglomerated grid implies a triangulation of the vertices of a dual graph of the coarse 
grid. Trying to reconstruct the triangulation is not a good idea, since this may result in a 
graph with intersecting edges (non planar graph), which leads to non- valid triangulations. 
If a valid triangulation could always be constructed, it would be possible to use the coarse 
grid triangulation for constructing piecewise linear operators for prolongation and restriction 
akin to the non-nested multiple grid scheme [3] . In practice, we have often found the implied 
coarse grid triangulations to be invalid and therefore the coarse grids are only defined in 
terms of control volumes. This has some important implications for the multigrid algorithm 
discussed below. 

Since the fine grid control volumes comprising a coarse grid control volume are known, 
the restriction is similar to that used for structured grids. The residuals are simply summed 
from the fine grid cells and the variables are interpolated in an area- weighted manner. For the 
prolongation operator, we use a simple injection (a piecewise constant interpolation). This 
is an unfortunate but unavoidable consequence of using the agglomeration strategy. A piece- 
wise linear prolongation operator implies a triangulation, the avoiding of which is the main 
motivation for the agglomeration. However, additional smoothing steps may be employed 
to minimize the adverse impact of the injection. This is achieved by applying an averaging 
procedure to the injected corrections. In an explicit scheme, solution updates are directly 
proportional to the computed residuals. Thus, by analogy, for the multigrid scheme, correc- 
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Figure 3: No. of coarse grid cells as a function of the fine grid cells they contain. 


tions may be smoothed by a procedure previously developed for implicit residual smoothing 
[3]. The implicit equations for the smoothed corrections are solved using two iterations of a 
Jacobi scheme after the prolongation at each grid level. 

The agglomeration step is done as a preprocessing operation on a workstation. It is 
very efficient and employs hashing to combine the multiple fine grid control volume edges 
separating two coarse grid cells into one edge. The time taken to derive 5 coarse grids on a 
Silicon Graphics work station model 4D/25 (20 MHz clock) for the grid shown in Figure 1 
with 11340 vertices is 83 seconds. 


4 Results and discussion 

Results are presented for two inviscid flow calculations and the performance of the agglom- 
erated multigrid algorithm is compared with that of the non-nested multiple grid multigrid 
algorithm of [3]. The first flow considered is flow over an NACA0012 airfoil at a freestream 
Mach number of 0.8 and angle of attack of 1.25°. The dual to the fine grid having 4224 
vertices is shown in Figure 5. The sequence of unnested grids (not shown) for use with 
the non-nested multigrid algorithm contains 1088, 288 and 80 vertices, respectively. The 
agglomerated grids are shown in Figure 6. These grids have 1088, 288 and 80 vertices (re- 
gions) as well. Figure 7 shows the convergence histories obtained with the non-nested and 
agglomeration multigrid algorithms. Both the multigrid strategies employ W-cycles. The 
convergence histories show that the multigrid algorithm slightly outperforms the agglomera- 
tion algorithm. The CPU times required for 100 iterations on the Cray Y-MP/1 are 25 and 
24 seconds, respectively. Thus the two schemes perform equally well. 

The next case considered is flow over a four-element airfoil. The freestream Mach number 
is 0.2 and the angle of attack is 5°. The fine grid has 11340 vertices and is shown in Figure 1. 
The coarse grids for use with the non-nested multigrid algorithm (not shown) contain 2942 
and 727 vertices. The two agglomerated grids are shown in Figure 8. These grids contain 
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Figure 4: An example of an agglomerated coarse grid. 


3027 and 822 vertices (regions), respectively. The convergence histories of the non-nested 
and agglomeration multigrid algorithms are shown in Figure 9. The convergence histories 
are comparable but the convergence is slightly better with the agglomerated multigrid strat- 
egy. This is a bit surprising since the original multigrid algorithm employs a piecewise linear 
prolongation operator. A possible explanation is that the agglomeration algorithm creates 
better coarse grids than those employed in the non-nested algorithm. The CPU times re- 
quired on the Cray Y-MP are 59 and 58 seconds with the original and the agglomerated 
multigrid, respectively, using three grids. 

Perhaps the biggest advantage of the agglomeration algorithm lies in its ability to generate 
very coarse grids without any user intervention. Such extremely coarse grids should be 
beneficial in multigrid. Figure 10 shows two coarser grids for the four element airfoil case. 
These grids contain 63 and 22 vertices, respectively. With these grids it is now possible 
to use a 6 level agglomeration multigrid strategy. However, because these coarse grids are 
rather nonuniform, it is imperative that the first order coarse grid operator be a strictly 
positive scheme (i.e. one can no longer rely on assumptions of grid smoothness as conditions 
for stability). With the original first order operator in place, which is composed of a central 
difference plus a dissipative flux, it is difficult to guarantee the positivity of the scheme for 
arbitrary grids. In fact, the scheme has been found to be unstable on some of the very coarse 
and distorted agglomerated meshes. However, if the flux is replaced by a truly first order 
upwind flux, given for example by Roe’s flux difference splitting [10], a stable scheme can be 
recovered for these coarse agglomerated grids. Thus, for each of the coarse grids obtained 
by agglomeration, a check of the convergence properties of the coarse grid operator at the 
desired flow conditions is carried out if problems are experienced with the multigrid. This 
step ensures that the coarse grid operators are convergent and that the problems with the 
multigrid, if any, come from the inter-grid communication. Figure 11 shows the convergence 
history with the 6 grid level agglomerated multigrid scheme. Also shown is the convergence 
with the 3 grid agglomeration multigrid scheme. In this particular case, Roe’s upwind flux is 



Figure 5: Dual to the fine grid having 4420 vertices. 


used on the two coarsest grids, where central differencing proved unreliable. The time taken 
for the 6 grid agglomeration multigrid is 86 seconds. Thus the improved convergence rate is 
not entirely reflected in terms of the required computational resources. This is attributed to 
the increased time required by the Roe’s upwind scheme, which involves a substantial number 
of floating point operations. This case serves to demonstrate the importance of the stability 
of each of the individual coarse grid operators. in the multigrid scheme. Although first order 
upwinding has been employed on the distorted coarse meshes for demonstration purposes, it 
should be possible to construct stable central difference operators on such meshes. 

5 Conclusions 

It has been shown that the agglomeration multigrid strategy can be made to approximate 
the efficiency of the unstructured multigrid algorithm using independent, non-nested coarse 
meshes, in terms of both convergence rates and CPU times. It is further shown that arbi- 
trarily coarse grids can be obtained with the agglomeration technique, although care must 
be taken to ensure that the coarse grid operator is convergent on these grids. Agglomeration 
has direct applications to three dimensions, where it may be difficult to derive coarse grids 
that conform to the geometry. In future work, alternate methods of generating coarse grids 
will be investigated. These may include the creation of maximal independent sets to create 
the coarse grid seed points and using these seed points to agglomerate the fine grid cells 
around them. A maximal independent set is a subset of the graph containing only vertices 
that are distance 2 apart in the original graph. Since coarsening algorithms can be viewed as 
partitioning strategies, there also exists a possible interplay between agglomerated multigrid 
techniques and distributed memory parallel implementations of the algorithm, which should 
be further investigated. Finally, the implementation of the viscous terms for Navier-Stokes 
flows on arbitrary polygonal control volumes must be carried out for this type of strategy to 
be applicable to viscous flows. 
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Figure 6: Three agglomerated coarse grids for the NACA0012 test case. 
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Figure 8: Two agglomerated coarse grids for the four-element test case. 
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Figure 9: Convergence histories with the agglomerated and original multigrid. 
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Figure 11: Convergence histories with the 6-level and 3-level agglomerated multigrid algo- 
rithms. 
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SUMMARY 


The multigrid properties of two data reconstruction methods used for achieving second-order 
spatial accuracy when solving the two-dimensional Euler equations are examined. The data recon- 
struction methods are used with an implicit upwind algorithm which uses linearized backward-Euler 
time-differencing. The solution of the resulting linear system is performed by an iterative procedure. 
In the present study only regular quadrilateral grids are considered, so a red-black Gauss-Seidel itera- 
tion is used. Although the Jacobian is approximated by first-order upwind extrapolation, two alterna- 
tive data reconstruction techniques for the flux integral that yield higher-order spatial accuracy at 
steady state are examined. The first method, probably most popular for structured quadrilateral grids, 
is based on estimating the cell gradients using one-dimensional reconstruction along curvilinear coor- 
dinates. The second method is based on Green’s theorem. Analysis and numerical results for the two- 
dimensional Euler equations show that data reconstruction based on Green’s theorem has superior 
multigrid properties as compared to the one-dimensional data reconstruction method. 


INTRODUCTION 


Multigrid methods have become a popular tool for obtaining steady solutions of the Euler or 
Navier-Stokes equations. Although true multigrid performance is difficult to obtain, there is no doubt 
that multigrid methods can significantly decrease the computer time necessary for convergence. 
However, the gain in performance from a single grid algorithm is directly related to the type of 
smoothing operator used on each level. Although explicit methods may be simple to program and 
have a relatively small number of operation counts, the unconditional stability that implicit methods 
offer tends to greatly overcome their disadvantages. In addition, explicit time advancement methods 
generally do not exhibit good smoothing properties when used with higher-order upwind data recon- 
struction techniques for a system of equations. 

In addition to the time advancement technique, the method of flux evaluation plays an important 
role in algorithm efficiency. One commonly used way to achieve higher order accuracy is to recon- 
struct the data on cell faces appropriately using the cell centered data. For grids which consist of log- 
ically rectangular cells, the most popular approach is to use simple one-dimensional curve fitting 
methods such as used by Anderson et al. [1]. The one-dimensional data reconstruction methods have 
been used with great success in two and three-dimensional CFD codes which use grids consisting of 
logically rectangular cells. 


General fluid dynamics problems may require generating grids around complex shapes for which 
it is difficult to generate a single grid consisting of logically rectangular cells. Using multiple-block 
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grids to model complex geometries has been implemented with success using multigrid algorithms 
[2] [3]. Another approach for generating grids around complex geometries is to use triangular ele- 
ments. On unstructured triangular grids, however, data reconstruction methods based on Green’s the- 
orem are more prevalent since this does not require interpolation along a coordinate direction. 

In reference [4] the authors presented a single grid stability analysis and numerical experiments 
of several different data reconstruction methods. In this paper, we extend this work to show the effect 
of the data reconstruction on multigrid performance. The Full-Approximation Scheme (FAS) multi- 
grid method has been incorporated into a quadrilateral-based unstructured grid Euler solver using the 
implicit time marching method of reference [5]. 


GOVERNING EQUATIONS 


The governing equations are the time-dependent Euler equations, which express the conservation 
of mass, momentum, and energy for an inviscid gas. The equations are given by 


^ + -<CfnJO = 0 

dt A J 


( 1 ) 


Q 


where A is the area of the cell that is bounded by the contour £2 with the outward-pointing unit normal 
ft. The state vector Q and the flux vectors F are given as 


Q = 


>" 


pU 

pu 

, F n = 

pUu + ph x 

pv 


pUv + ph y 

e 


. (e + p)U _ 


( 2 ) 


where p is the density, u and v are the x and y components of the v elocity , e is the energy per unit vol- 
ume, p is the pressure, and U is the velocity in the direction of the outward pointing normal to the cell 


U = n x u + n^v 


The equations are closed with the equation of state for a perfect gas 

P = {y- l)[e-p(w 2 +v 2 )/2] 
where yis the ratio of specific heats. 


(3) 


(4) 


TIME ADVANCEMENT ALGORITHM 


The method used for accelerating the solution to steady state is the Full Approximation Scheme 
(FAS) multigrid method. The technique used for smoothing the errors on each grid level is based on 
the scheme described in reference [5] applied to a grid of quadrilateral cells. The method is an 
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implicit upwind algorithm that uses linearized backward-Euler time differencing. The cell-averaged 
solution vector Q is updated at each time level n with the equations 


L n AQ” = -R(Q") 

Q n+1 = qVAQ” 


(5) 

( 6 ) 


The onerator R (O n ) is the discrete approximation to the flux integral in equation (1) at time level n. 

fluxes are evaluated with Van Leer flux-vector splitting (6) and are second-order accurate tf a l.n- 
ear data reconstruction method is used. The operator V is written as 


At 9Q n 


(7) 


To minimize the bandwidth and maintain block-diagonal dominance of the matrix L , the Ja ^ obiaa 
9RV3Q" is approximated by first-order upwind differencing rather than by exactly ^e^mgthe 
second-order right-hand side of equation (5). The steady-state solution remains second-order acc 
rate The solution of the linear system (5) is performed by an iterative procedure. In the Present study 
subiterations are performed using red-black Gauss-Seidel where the flux-Jacobians “ 
frozen at the current time level. It is recognized that the linear system must be solved adequate y to 
gain the full benefits of an implicit formulation. However the scope of 

effects of various data reconstructions to compute the right-hand side of equation (5). T y 

and smoothing analysis presented later assumes the linear system is solved exactly at each time step. 


UPWIND STENCILS 


All of the reconstruction stencils used for the right-hand side of equation (5) in this study sure 
based on MUSCL-type differencing [6], In this approach, the flux vector F is split into two compo- 
nents 


f a = f(q + ,q _ )=f (q + )+f + (q ) 


where 


( 8 ) 


(9) 


Qface = Qcell+ 0± (Q) 

The values of Q are determined on each side of a cell face by using an interpolation operator 0, an 
reconstructing the cell-centered data on each face as shown in figure 1. Upwind fluxes are compu 
Sm=ce values with Van Leer flux-vector splitting [6]. The stencils that are considered dif- 

fer in the interpolation operator 0. 

One of the most common methods of data reconstruction for upwind structured flow solvers is to 
interpolate the data to the cell face using only the cells along the curviline^ coori,na^r«hon 
which is perpendicular to the face [1]. Using the cell numbering shown in figure 2, a family of 
schemes is given by 
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(10) 


Qface = Q 2 + ^[(1 - *)A_ + (1 + x:)A + ]Q 2 
Qface = Q3 ~ ^[(1 + ^)A_ + (1 - K-)A + ]Q 3 

where 


( 11 ) 


A +Q, = Q, +i -Q, 
A-Q,=Q,-Q,_, 


( 12 ) 

( 13 ) 


These formulas assume the grid has been transformed from physical {x, y) space to computa- 
tional (%, rj) space where the grid spacing ( 5 £ 877) is unity. Using this family of schemes as the inter- 
polation operator results in the flux integration in a cell depending on a total of 9 cells for - 1 < k < 1 
as shown in figure 3. 



Figure 1 . Data reconstruction for upwind fluxes 
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We can examine the relation between the discrete equations (10) - (13) and equation (8) by 
expanding the terms of the equations in a Taylor’s series. Examining the interpolation of Q along a § 
coordinate line, a Taylor’s series expansion about cell 2 is written as 


* ,oQ 

Qface = Q2 + 


d 2 Q 


S=S 2 


2 3? 


&2 


If K= 0, a central difference across cell 2 is used to calculate the gradient so that 


(14) 


0-(Q) = A ^| 




1YQ3-Ql 


For K= -1, the gradient is approximated using only one-sided information 


(15) 


e-(Q)=i^ 






(16) 


Although not considered in this study, if k:= 1/3, the first and second derivatives of equation (14) are 
estimated with central differences which yield a spatially third-order accurate steady-state solution in 

one dimension. 


The other stencil used in constructing the data on the face is based on Green s theorem. This was 
used for triangular grids by Barth and Jespersen [7] and Frink [8]. This method of data reconstruction 
was also used by Anderson [5] on triangular grids in conjuction with the implicit scheme shown here. 
The interpolation operator is evaluated in physical ( x , y) space and is written as 

© ± (Q) = (VQr) ± (17) 

where VQ is the average gradient in the cell and is evaluated using Green’s theorem. 


dQ 

dx 

dQ 

Sy 


Q 

i|(Q)n/n 


(18) 


To evaluate this numerically, inverse-distance weighting is used to transfer the cell-averaged data to 
the nodes [8]. 


Q 


node 



(19) 
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where r,- is the distance from the i'-th cell center to the node. This reduces to simple averaging for uni- 
form grids. Next, the trapezoidal rule is used to integrate around the cell. The .t-component is given 
by 


m x - 



Qnode 2 
2 ~~ 





( 20 ) 


Here, A is the area of the cell, node] and node 2 define face,-. As is the length of face,-, and n x is the x- 
component of the outward pointing unit normal. The data on the cell faces is then determined using 
(17) where the position vector, r, is computed from the cell center to the face center. Using Green’s 
theorem and the trapezoidal rule results in a stencil of 21 cells for the flux integration. The complete 
procedure for determining Q values on the cell faces is shown in figure 4. 
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Figure 4. Data Reconstruction Using Green’s Theorem 


TRUNCATION ERROR 


A truncation error analysis for the 9-point stencils using k= 0 and K=- 1 as well as the 21 -point 
stencil has been shown in reference [4] and is summarized here for completeness. The truncation 
error of each of the three stencils is examined by considering the semi-discrete approximation to a 
scalar advection equation with non-negative coefficients a and b. 

du du . du _ /-i i \ 

— + a — + b— = 0 (21) 

dt dx dy 

This linear equation is a simplified model of the two-dimensional Euler equations. 

Leaving the equation continuous in time, the spatial derivatives are approximated by each stencil 
and expanded in a Taylor series about the point being updated. The 9-point stencil with k = - 1 leads 
to the following equation: 
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(22) 


where 5x and by are the grid spacing in the x and y directions, respectively. The difference approxi- 
mation is second order in the grid spacing with a dispersive leading truncation error term. The 
approximation is also dissipative, as can be seen from the fourth-derivative term of the truncation 
error. For an advection velocity that is aligned with the grid (a or b = 0), the dissipative term reduces 
to a fourth derivative in the flow direction. 

For the 9-point stencil with K = 0 we get the following equation: 
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(23) 


This equation differs from equation (22) in the magnitude of the coefficients of the dispersive and dis- 
sipative terms. We expect this difference formula to be less dissipative than the fully-upwind stencil. 


A Taylor series expansion of the 21 -point node-averaged stencil for the scalar advection equation 
gives the following: 
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(24) 


This equations looks remarkably similar to equation (23), as the coefficients of the dispersive and dis- 
sipative terms are identical. However, the dissipative term of the 21-point stencil contains cross deriv- 
atives and looks similar to a biharmonic term. Note that even for a grid-aligned advection velocity the 
cross-derivative term does not vanish. We expect that this difference stencil, although of the same for- 
mal accuracy as the 9-point stencil, will be more dissipative. 


STABILITY ANALYSIS 


The basic stability properties of the upwind stencils considered here were examined in reference 
[4]. A Von Neumann analysis is used to examine the stability and convergence properties of the 9- 
point K= 0 and K= -1 stencils and the 21-point stencil. For each of the stencils, the equations are dis- 
cretized according to equations (5) to (7). The operator L n is obtained by first-order interpolation in 
all cases, and the right-hand side R (Q n ) is obtained with the three second-order stencils. 
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Although the Von Neumann analysis is commonly applied to the scalar advection equation, we 
examine the stability of the system obtained by linearizing the Euler equations about a constant state, 
similar to the work in reference [5]. Applying a Fourier transform in space to the solution vector Q" 
gives the equation 

Q" = z"Q 0 exp(/0)exp(j» (25) 

where <j> = nx/Sx, y/ = ny/Sy are the Fourier modes in the x and y directions, respectively, and z is 
the amplification factor. Substitution of this expression into (5) yields the following equation 

L{(z-1)Q 0 } = -R{Q 0 } (26) 

where L and R are the Fourier symbols of the left- and right-hand-side operators for the constant- 
coefficient problem. Equations (25) and (26) lead to a generalized eigenvalue problem for z. By rear- 
ranging terms, we define the amplification matrix 

6 = I-t ft (27) 

and z is an eigenvalue of G. The amplification matrix is 4 x 4 and complex; a necessary condition for 
stability is that the magnitude of the eigenvalues of G are less than one for all <p and if/. We will refer 
to the amplification factor for a given mode as the magnitude of the largest eigenvalue for that mode. 
The matrix G depends upon four parameters: the Mach number; the flow direction; the CFL number, 
defined here as c8t/5x, where c is the speed of sound; and the cell aspect ratio, 8y!5x. 

The eigenvalue problem was solved numerically for a series of Fourier modes 0 and if/ in the 
range [-7T, k] . Below we show the amplification factors for a Mach number of 0.8, flow aligned with 
the grid in the ^-direction, a CFL number of 100, and a cell aspect-ratio of 1. These results are typical 
of the stability properties of the implicit scheme at other Mach numbers. 

Shown in figure 5 are the amplification factors for the 9-point stencil with k = -1 and K = 0 
for a CFL of 100. This CFL number represents the asymptotic behavior for the three stencils consid- 
ered here as shown in reference [4], Note that the fully-upwind scheme ( k = - 1 ) has very poor 
damping of the short- wavelength modes. As CFL the amplification factor of the <p = ±tt mode 

asymptotically approaches 1. Although unconditionally stable, the scheme is a very poor smoother 
for an FAS multigrid scheme using high CFL numbers. On the other hand, the upwind-biased stencil 
( k = 0) leads to a scheme with excellent smoothing properties. All the Fourier modes are very well 
damped; in particular, the checkerboard and sawtooth modes have an amplification factor that tends 
to 0 with increasing CFL numbers. This scheme appears to be a very good multigrid smoother. 

By using the 21 -point stencil to discretize the steady-state operator we get even better stability 
properties, as is seen in figure 6. All the high-frequency modes are damped extremely well; the ampli- 
fication factor for <f>, if/= ±;r has an asymptote of 0, making this operator an excellent choice as a mul- 
tigrid smoother. 

Considering the 9-point, K= 0 stencil and the 21 -point stencil in the case where the flow is skew 
to the grid, we get the results shown in figure 7. In both cases the damping of the short wavelengths is 
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essentially unchanged. The damping of the long wavelengths is worse, however, and the deterioration 
is somewhat more noticeable for the 9-point stencil, particularly for the intermediate wavelengths. 
The 21-point stencil retains its excellent stability properties over a larger range of wavelengths. 



Figure 5. Amplification factors for 9-point stencil, Mach = 0.8, a = 0, CFL = 100: k= -1 (left) 

and k= 0 (right) 



Figure 6. Amplification factor for 21 -point stencil, Mach = 0.8, a = 0, CFL =100 

Shown in figure 8 are the smoothing factors, defined as the maximum of the amplification factor 
over the range k/2 < | $ , | y| < k, and average amplification factors for the 9-point stencil over a 
range of CFL numbers from 1 to 1024 and xfrom -1 to l.The Mach number and flow angle are 0.8 
and 45 degrees, respectively. These plots clearly show that the K = 0 stencil has the best smoothing 
properties for the 9-point stencil. 

A comparison of the smoothing and amplification factors for the 21 -point and the 9-point, K = 0 
stencils is shown in figures 9 and 10. Shown in figure 9 are the smoothing and average amplification 
factors for flow aligned with the grid. Note that for CFL numbers up to about 16, the smoothing fac- 
tors are identical. The asymptotic smoothing factors are slightly different: 0.524 and 0.563 for the 21- 
point and 9-point stencils, respectively. In contrast to the smoothing factors, the average amplification 
factor is about 50% lower for the 21 -point stencil compared to the 9-point stencil. In figure 10 plots of 
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Figure 7. Flow at 45 degrees to the grid: 9-point stencil, Mach = 0.8, a = 45 degrees , CFL = 100: 

k= 0 (left) and 21 -point stencil (right) 




Figure 8. Smoothing factors and average amplification factors for K methods 


the smoothing factor and average eigenvalues are shown for both stencils for flow at 45 degrees to the 
grid. The average amplification factors are virtually unchanged, but there is some difference in the 
smoothing factors. The asymptotic values of the smoothing factors have deteriorated, increasing to 
0.554 and 0.628 for the 21 -point and 9-point stencils, respectively. The 21 -point stencil’s smoothing 
factor is less sensitive to the flow angle than that of the 9-point, k = 0 stencil. 


The effect of grid aspect ratio on the 21 -point and 9-point k = 0 stencil is shown in figure 11. 
Note that there is a large degradation in the smoothing properties for the 9-point K= 0 stencil when 
using high aspect ratio cells such as those in a viscous calculation near a solid wall or wake region. 
The 21 -point stencil, however, is generally not affected by the cell aspect ratio. This insensitivity of 
the smoothing factor as the flow angle and grid aspect ratio changes means that we expect that it will 
result in more uniform multigrid performance than the 9-point, K = 0 stencil, over a variety of flow 
conditions and grid topologies. 


672 




Figure 9. Smoothing factors and average amplification factors for the 21 -point stencil 
and the 9-point stencil, K= 0, for a flow angle of 0 degrees 




Figure 10. Smoothing factors and average amplification factors for the 21 -point stencil 
and the 9-point stencil, K= 0, for a flow angle of 45 degrees 

EULER RESULTS 


Results for the two-dimensional Euler equations are now presented. Two test cases are used in 
this study. The first case is the subsonic flow in a channel with a 3% sinlrbump. This case was chosen 
because the flow is nearly grid aligned in every cell. The channel length is three times the channel 
height and the length of the bump is equal to the channel height. A freestream Mach number of 0.3 is 
used. The grid used in this study consists of 157 points along the wall and 49 points normal to the 
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Log] 0 ( dy / dx ) 


Figure 1 1 . Effect of Grid Aspect Ratio on Smoothing Factor for Flow Aligned with Grid 

wall and is shown in figure 12. The density contours for the converged solution using the 21 -point 
stencil are also shown in figure 12. All of the cases utilize a 3-level V-cycle using 15 subiterationTto 
solve the linear system at each level. One smoothing iteration is performed on each level except the 
coarsest grid where 3 smoothing steps are performed. 

Convergence histories for this case using the 9-point stencil with k=-1 are shown in figure 14a. 
As the CFL increases, the convergence rate improves up to a CFL of about 10 after which the conver- 
gence degrades, eventually becoming unstable. As discussed above, when the CFL is increased, high 
frequency error modes approach neutral stability. The analysis, however, assumes the linear system is 
solved exactly at each time step which is generally not the case with only 15 subiterations. Therefore, 
the scheme may require a prohibitive number of subiterations to remain stable at high CFL numbers. 

The convergence histories for the 9-point stencil with K= 0 are shown in figure 14b. Unlike the 
9-point stencil with k=- 1, this stencil produces very good convergence rates as the CFL is increased. 
Note that there is little decrease in the spectral radius after a CFL of 100. This is consistent with the 
analysis shown in figure 10. The convergence histories for the 21 -point stencil are shown in figure 




Figure 12. 3% Sin 2 (x) bump grid and contours 
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14c and are very similar to the 9-point, K= 0 stencil. For this test case, in which the flow is aligned 
with the grid, both stencils have very good convergence properties. 

To examine the behavior of the schemes with higher aspect ratio cells and when the flow is not 
aligned with the grid, a second test case is considered which is a NACA 0012 airfoil in a Mach = 0.8 
freestream at 0 degrees angle-of-attack. The calculations were performed on a 65x25 c-grid which is 
shown in figure 13 along with the converged density contours obtained with the 21 -point stencil. All 
cases were run using a 3-level V-cycle and 20 subiterations to solve the linear system. 



Far Field Grid Near Field Density Contours 


Figure 13. NACA 0012, Mach = 0.8, a = 0° grid and contours 


The convergence histories for both the 21 -point stencil and 9-point stencil with K= 0 are shown 
in figure 14d. Only the K= 0 value is used because of the poor convergence properties of the k= -1 
stencil. As shown, the 21 -point stencil converges significantly faster than the 9-point K= 0 stencil. In 
particular, note that the number of multigrid cycles to reach a residual of 10” 16 using the 21 -point 
stencil is about the same as for the channel flow. By contrast, the 9-point, K= 0 stencil shows a 
marked deterioration in performance compared to the channel flow case. These results are consistent 
with the analysis for flow angularity and cell aspect ratio effect presented above. 


DISCUSSION 


The analysis and computations presented indicate that the choice of data reconstruction for 
upwind methods can have a substantial effect on the multigrid performance for a given time advance- 
ment scheme. In particular, the popular 9-point, K= -1 stencil exhibits very poor multigrid conver- 
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a) Sin 2 * bump, 9-point, k= -1 stencil b) Sin 2 * bump, 9-point, k= 0 stencil 




d) Transonic airfoil, 21 -point and 9-point, 
K - 0 stencil 


Figure 14. Residual Histories 


gence for high CFL numbers. The 9-point, k= 0 stencil has much better smoothing properties but still 
has difficulty damping the high frequency waves if the flow is not aligned with the grid. By using an 
interpolation operator based on Green’s theorem, excellent smoothing properties are obtained for 
high CFL numbers regardless of the flow angularity as shown in figures 9 and 10. This has been 
shown through analysis and confirmed through numerical experiments. 
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ABSTRACT 


A single grid local mode analysis is used to predict the smoothing properties of numerical schemes 
for solving the Navier-Stokes equations with factorization based on Stone’s Strongly Implicit Method. 
Four difference approximations for the convection terms are considered, namely, hybrid, central, second- 
order upwind, and third-order upwind. Smoothing factors from the analysis are compared with practical 
convergence factors in a multigrid method for flow over a backward facing step and it is found that the 
local mode analysis correctly predicts the effects of Reynolds number and higher-order schemes. 


1 INTRODUCTION 


The successful use of multigrid methods to accelerate convergence rates is dependent on the ability of 
the numerical algorithm to dampen high frequency error components since these components cannot be 
resolved on coarser grids. High frequency components have short coupling ranges; therefore, their 
smoothing is a localized process meaning that only one isolated computational stencil need be analyzed 
and the effect of boundaries can be neglected. This is the approach of local mode analysis for the predic- 
tion of smoothing properties which was first introduced by Brandt [1] for various partial differential equa- 
tions and numerical algorithms. Shaw and Sivaloganathan [2] extended this analysis to the SIMPLE 
pressure correction algorithm using alternating direction implicit (ADI) relaxation for the solution of the 
algebraic system of equations for varying Reynolds numbers and under-relaxation factors. Convection 
terms were approximated using a hybrid of first-order upwind and second-order central differencing. 

The present paper uses local mode analysis to predict the smoothing properties of numerical algo- 
rithms for calculation of two-dimensional recirculating flows using higher-order difference schemes for 
convection terms introduced via deferred correction and Stone’s Strongly Implicit Method for factoriza- 
tion of the resulting system of algebraic equations. Reynolds number and higher-order convection 
approximation effects are addressed and compared to multigrid results for laminar flow over a backward 
facing step. 
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2 THEORETICAL ANALYSIS 


2.1 Governing equations 


The equations governing steady, two-dimensional, incompressible flow can be written as: 


£(p«) + A(pv) -o 


JL(P„ ! ) +£<p«v) 
|(P«V) + l;(pv ! ) 



CD 

( 2 ) 

(3) 


where u and v are the velocity components in the x and y directions, respectively, p is the pressure, u 
is the absolute viscosity, and p is the density. Equations (1) - (3) represent the conservation of mass and 
momentum in the x and y directions, respectively. 


The solution sequence is a predictor-corrector method which follows the SIMPLE algorithm of Patan- 
kar and Spalding [3]. Factorization of the system of equations is based on Stone’s Strongly Implicit 
Method. The flow geometry and boundary conditions are shown in Figure 1. 

The governing equations are discretized by integrating over a set of three staggered control volumes 
and the locations of variables are shown in Figure 2. The central control volume is for the pressure. Equa- 
tion (2) can be discretized by integrating over the left-shifted control volume for the u component of 
velocity. This leads to: 


cipUp = a E u E + a w u w + a N u N + a s u s + a EE u EE + a ww u ww + a NN u NN + a ss u ss 

~ — /, —• + (fe) (v " _ Vmv+ ^ + ^ 2h ' + O 


(4) 


Equation (3) is discretized by integrating over the bottom-shifted control volume for the v component 
of velocity: 


cipVp = a E v E + a w v w + a N v N + a s v s + a EE v EE + a ww v ww + a NN v NN + a ss v ss 
rir\u E -u SE +u s -Up) + ^(v N -2v p + v s ) 

AV hi 


(Pp Ps ) f Pp 


(5) 


where the coefficients contain convection and diffusion terms, the subscripts of u, v, and p refer to 
the location of the variables (see figure 2), h x and h y are the grid spacing in the x and y directions, respec- 
tively, and p o is the absolute viscosity of the fluid, assumed constant 


2.2 Approximation of convection terms 


The a t coefficients in equations (4) and (5) are dependent on the approximation used for the convec- 



tion terms. The present analysis investigates the hybrid, central, second-order upwind (2nd OU), and 
third-order upwind (3rd OU) approximation schemes. 

The coefficients for the hybrid scheme have the following form: 
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where the sum for a n is taken over the a coefficients, u 0 and v 0 are the frozen velocity components 
due to the linearization of equations (2) and (3), the max{a.b} operator selects the maximum of the argu- 
ments a and 6, and 1 1 represents the absolute value. 


The coefficients for the higher-order schemes have the following general form: 
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The values of fj and /2 depend on the higher-order method to be used. For central differencing, //=/? 
= 0, for 2nd OU differencing,/; = 1/2 and/ 2 = 1, and for 3rd OU differencing,/; = 1/8 and/ 2 = 1/4. 


2.3 Local mode analysis 


The Strongly Implicit Method (SIP) by Stone [4] solves the algebraic equations shown in equations 
(4) and (5) in a fully implicit manner. All variables are treated as unknowns, as opposed to line-relaxation 
methods which consider only lines of constant x or y as unknowns while sweeping through the computa- 
tional domain. In the local mode analysis outlined below, the variables that are updated at the end of a 
relaxation sweep will be denoted by a dot over the variable, such as u P , while those from the beginning of 
the relaxation sweep or those unchanged by the current relaxation sweep will be written as above, such as 
Up. Higher-order approximations for the convection terms are introduced via deferred correction (see 
Khosla and Rubin [5]). In this procedure, the o,- coefficients are calculated initially using equations (6) - 
(11) for the hybrid scheme. As the solution proceeds, the higher-order scheme is slowly introduced via 
corrections to the source terms. At the end when the solution is fully converged, the coefficients are effec- 
tively those of the higher-order scheme outlined in equations (12) - (20). The base hybrid coefficients will 
be denoted by an additional subscript h, such as a p \ h , while the higher-order coefficients will not have an 
additional subscript and will be written as abcve, such as a P . 

The relaxation of equation (4) (the discretized x momentum equation), with under-relaxation and 
deferred correction can be written as: 

a p\u p = a p\u P + r m (a E \ h u E + a w \ h u w + a N \ h u N + a s \ h u s - a P \ h u P ) 

, ~(.Pp~Pw) f M„ V M a . 
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* x y h x 
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+ a EE ( U EE~ U p) + a Ww( u WW~ u p) + a NN ( U NN~ U p) + a SS ( U SS ~ U p) 1 (21) 

where r m is the under-relaxation factor for the x and y momentum equations, r; is the relaxation factor 
for introducing higher-order coefficients and is set to unity for the analysis. The exact solution for U, V, 
and P also satisfies equation (21). If an equation written with the exact solution for U, V, and P is sub- 
tracted from equation (21), which is written in terms of the approximate solution, u, v, and p, the error in 
the solution can be introduced. The error has components defined as; e“ - U-u, e v = v-v, and 
e* = P-p. 

Equation (21) written in terms of the error becomes: 
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Since the continuous governing equations (1) - (3) have been linearized during the discretization, a 
single Fourier component of the error can be considered as: 


e p = a e e 




eV = «V L ' ^ 


e s = a a e 


r X (y- k vft 

'[V^ 


( 23 ) 


where / = «PT, 0, and 0, are the components of the phase angle vector, a e , which is the error ampli- 
tude of the single Fourier mode 0„ 0 2 . Similar expressions exist for other grid points to the east and 
north and for the variables v and p. Substituting the single Fourier modes into equation (22) and dividing 
through by e ,[ V' h . +< V' h , ] > equation (22) becomes: 


Wi- 


^l^ a P ] h~ r '»[ a E ] h e ' B ' + a W ] h e ' + a N ] h e 1 + Q s\ e * h 2 

«,!*(! - r m ) + 'VJ (a E -a E \ h ) (e* 1 - 1) + (a w - a w \ h ) (e ‘ e ’ - 1) + (a N - a N \ h ) (e 2 - 1) 
+ (fl s -a s IJ 1) +a EE (e 2 ‘ 6 '- 1) +a ww (e 2 *' - 1) +a NN (e 1 -l)+a ss {e 2 -l)l) 


4r «IW2 v 2 Vl [ o 

Wy “ 9 K 9 


( 24 ) 


where: jj = sin(0j/2) 

s 2 = sin (6 2 /2) 


Equation (24) can be written in a more compact form, if the following variables are defined: 

( i0 -;e, . ie 2 -.e 2 
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Equation (24) can then be written as: 

de = — (va“ e - <pa v e - CV e) 


( 25 ) 


Following the same procedure for equation (5) (the discretized y momentum equation) yields. 
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(26) 
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where: 
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Equations (25) and (26) can be combined to give the amplification matrix A j for the complete opera- 
tion. This yields: 


n 

— (p V 


111 


0 2 


<pS“ V 


In compact form: 


«e = ^i«e 


(27) 


Equation (27) yields the amplification matrix defining how the amplitude of the Fourier mode, with 
phase angles 0j and 0 2 , is amplified during relaxation of the x and y momentum equations. 

The SIMPLE pressure correction of Patankar and Spalding [3] follows the relaxation of the x and y 
momentum equations. A single dot over a variable will denote a value at the completion of the relaxation 
of the x and y momentum equations. A double dot over a variable will denote a value at the completion of 
the pressure correction. The variables u, v, and p are corrected following Shaw and Sivaloganathan [2]: 


tip = tip- ~ (6 p p - 5p w ) 

(28) 

a P h x 


Y 


V P = Vp — (5 Pp 8p s ) 

(29) 

Ophy 


Pp = Pp+r p 5p p 

(30) 


where r uv is the relaxation factor for correcting u and v velocities and r p is the relaxation factor for 
updating pressure. The value 8 p is a pressure increment such that the velocity field u and v will satisfy 
conservation of mass. It is obtained by discretizing equation (3) and substituting for the velocities and 
corrections given in equations (28) - (30). This yields an equation for the pressure correction: 


(f pdpp - cf N 8p N + d s 8p s + d £ S p E + d w8p w - (u E - u P ) - -J- (v N - v p ) 


(31) 
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where: cf N = 


1 


E ~ 



1 



= c? w 


J’p = cfhi + (fs + ( ^’E + c ^’w 

Equations (28) - (31) can be written in compact form as: 


r uv s x s 

Up - Up -o h op h 

Op 

Op 

Ph = Ph + r p 5 ph 
P h 5 Ph - SVa + ^Va 

If equation (35) is solved for 5 p h and used in equation (32)-(34): 


( 32 ) 

( 33 ) 

( 34 ) 

( 35 ) 


■u h = u h - r ^&\p-\(V h u h + V k v h ) ( 36 ) 

o P 

v* = v*-^V , A ( 8 > a + SV a ) (37) 

a P 

Ph = Ph + r P F ~ l h(^ h u h + ^ h v h ) ( 38 ) 

This assumes that the pressure correction equation has been solved exactly. As before, the error is 
introduced by writing equations (36)-(38) using the exact solution and then subtracting the result from 
equations (36)-(38) respectively. The errors become: 



( 39 ) 

a p 

h - r -^V h p-\(8\i? h + V h E h ) 

( 40 ) 

a P 

^ h + r p p '* ( 8 \&* + a ) 

( 41 ) 


The Fourier components of the error can be substituted as before to give the A 2 amplification matrix 
which governs the amplification of errors during the pressure correction phase: 
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In compact form: 


& e - A 2 d 0 (42) 

where: P h = a^e' 0 ' + a p w e <0 ' + a^e' 91 + a p s e ,0J - <f P 

Equation (42) gives the amplification matrix defining how the amplitude of the Fourier mode with 
phase angles 0 1 and 0 2 is amplified during the pressure correction phase of the algorithm. The amplifica- 
tion matrix for relaxation of the x and y momentum equations and the pressure correction is obtained by 
combining equation (27) and (42) as: 


«e = A 2 A i“e = (43) 

2.4 Smoothing factor 


The smoothing factor is a measure of the worst reduction of the high frequency error components for 
one complete relaxation sweep. It is calculated as the largest eigenvalue of the amplification matrix A 
given by equation (43) for the Fourier modes 6 1 and 0 2 in the high frequency range defined as: 
7t/2<|0,| <n and n/2<|© 2 | <rc. 


3 RESULTS 


To test the predictive capability of the LMA presented in Section 2, flow over a backward facing step 
(BFS) was computed using a multigrid code based on the FAS-FMG (full approximation storage - full 
multi-grid) algorithm proposed by Brandt [1], Higher-order schemes were introduced through deferred 
correction only on the finest of three grids with constant grid spacing in the x and y directions. The grid 
sizes from coarsest to finest grid are, n x x n y = 66 x 18, 130 x 34, and 258 x 66, where n x is the number of 
grid points in the x direction and n y is the number of grid points in the y direction. Smoothing properties 
on the two coarser grids are identical for the four schemes since hybrid coefficients were used on these 
grids. Local mode analysis was used to estimate the smoothing factor on the finest grid and this result was 
compared to the number of work units to reach convergence for the multigrid result. The work units 
(WU) and convergence factor (CF) are indicators of the smoothing properties of the algorithm and 
numerical scheme. The work units for a two-dimensional problem with grid refinement in the x and y 
directions are defined as: 


WU = Jx j 2 2(, " W) (44) 

l = 1 

where x t is the number of iterations on the i th grid at convergence, / = 1 for the coarsest grid, and i = 
N for the finest grid. The convergence factor is defined as: 


CF 


('A) 


1/A.WU 


(45) 


where r 2 is the initial norm of the residuals of the x momentum, y momentum, and pressure correction 
equation on the fine grid, /y is the norm of the residuals at convergence on the fine grid, and A WU is the 
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change in work units on the fine grid. 

The BFS flow was solved for Reynolds numbers (based on the upstream channel height) of 100, 250, 
and 400 using the hybrid, central, 2nd OU, and 3rd OU schemes. The work units for convergence and 
convergence factors are shown in Table 1 . The smoothing factors were calculated for conditions identical 
to the finest grid of the multigrid results (the same grid spacing, relaxation factors, density, etc..). The 
only parameter varied was the frozen velocity components needed to calculate the coefficients in equa- 
tions (6) - (20) of the various schemes. The maximum velocity components provide an upper bound for 
the smoothing factor which will dominate the smoothing properties since it was found that as the cell- 
Reynolds number (Reynolds number with the length scale based on the grid spacing) approaches zero, 
corresponding to regions where the velocity approaches zero, the smoothing factor also decreases. An 
estimate of the maximum velocity components for the BFS flow is u Q = 1.5 aty = 0.25 at the inlet, and v 0 
= 0.15 near recirculation regions. Three principal flow directions are considered with velocity compo- 
nents given by: u 0 , v 0 = (0, 0. 15), (1.5, 0.15), and (1.5, 0). The SIP method exhibits symmetry about the x 
and y axis so that other flow direction results can be obtained from the three principal flow direction 
results. For example, the smoothing factors for u 0 , v 0 = (-1.5, 0.15), (-1.5, -0.15), and (1.5, -0.15) are 
equal to the smoothing factor for u 0 , v 0 = (1.5, 0.15). The smoothing factor is then defined as the largest 
eigenvalue of the amplification matrix A, defined by equation (43), for the three flow directions while 
restricting the phase angles to the high frequency range. Results from the three principal flow directions 
show that the flow direction u 0 , v 0 = (1.5, 0.15) produced the largest eigenvalue for all Reynolds num- 
bers, and thus the smoothing factor was based on this flow direction. The computed smoothing factors are 
shown in Table 2. 

The results of Table 2 show that as the cell-Reynolds number in the LMA is increased, the smoothing 
factor also increases for the four schemes. More work units will be required to smooth the high frequency 
error components. The results in Table 1 show that the work units increase and the convergence factor 
deteriorates as the Reynolds number increases. For the Re = 100 results, the LMA predicts that the 
smoothing properties of the hybrid, central, and 3rd OU will be virtually identical while that of the 2nd 
OU will be slightly worse. The multigrid results confirm this prediction. For the Re = 250 and 400 results, 
the LMA predicts that the hybrid difference scheme will have the best smoothing properties while the 
central difference scheme will have the worst, and the 2nd OU and 3rd OU difference schemes should be 
similar with the 3rd OU difference scheme slightly better. The multigrid results confirm these predictions 
with the exception being that the 2nd OU difference scheme results converged in slightly less number of 
work units when compared to the 3rd OU difference scheme. Their convergence factors are similar. 


Table I: Work Units/Convergence Factors of Multigrid Results 


Difference 

Scheme 

Re = 100 

Re = 250 

Re = 400 

Hybrid 

59/0.868 

137/0.930 

321/0.981 

Central 

58/0.873 

166/0.964 

569/0.993 

2nd OU 

63/0.891 

148/0.952 

421/0.988 

3rd OU 

58/0.875 

152/0.956 

428/0.988 
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Table II: Smoothing Factors from Local Mode Analysis 


Difference 

Scheme 

Re = 100 

Re = 250 

O 

3 

II 

a; 

Hybrid 

0.902 

0.924 

0.931 

Central 

0.910 

0.950 

0.968 

2nd OU 

0.931 

0.946 

0.950 

3rd OU 

0.899 

0.926 

0.935 

Re Ax^^ e Ay 

11.72/0.47 

29.30/1.18 

46.88/1.88 


CONCLUSION 


Local mode analysis was performed using four schemes for the approximation of convection terms: 
hybrid, central, second-order upwind, and third-order upwind, over a range of cell-Reynolds numbers. 
The smoothing factors from this analysis were compared with actual multigrid results for flow over a 
backward facing step to test the predictive capability of local mode analysis. It was found that this analy- 
sis is useful in predicting the smoothing properties of the four schemes along with the effect of flow Rey- 
nolds number. This analysis could be extended to predict optimum relaxation factors, grid aspect ratios, 
and other solution algorithms. 
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Figure 1. Geometry and boundary conditions for flow over a backward facing step. 
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Figure 2. Location of variables for staggered grid. 
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SUMMARY 


The performance of a linear multigrid method using four smoothing methods, called SCGS, CLGS, 
SILU and CILU, is investigated for the incompressible Navier-Stokes equations in general coordinates, 
in association with Galerkin coarse grid approximation. Robustness and efficiency are measured and 
compared by application to test problems. The numerical results show that CILU is the most robust, 
SILU the least, with CLGS and SCGS in between. CLGS is the best in efficiency, SCGS and CILU 
follow, and SILU is the worst. 


INTRODUCTION 


Robustness and efficiency of a multigrid method are strongly influenced by the smoother used. 
Because there are so many factors influencing robustness and efficiency, it is hard to say in general 
which method is the most appropriate choice for certain applications. In this paper, we study four 
smoothing methods for the incompressible Navier-Stokes equations in general coordinates, namely the 
SCGS (Symmetrical Coupled GauB-Seidel [18]), CLGS (Collective Line GauB-Seidel, adapted from 
SCAL [16]), SILU (Scalar ILU, or TILU in [23]) and CILU (Collective ILU [30]), respectively, which 
are used in a linear multigrid method. Galerkin coarse grid approximation (GCA) is used. An 
elementary introduction to GCA can be found in [21]. Application to the Navier-Stokes equations is 
discussed in [29] and [31]. 

The multigrid method using the above four smoothers solves the velocity and the pressure 
simultaneously (collectively). Decoupled solution is also used in practice, solving the velocity and the 
pressure separately. A comparison is given in [1] of multigrid methods using coupled solution with 
SCGS and CLSOR (Coupled Line Successive Over ^Relaxation) smoothing and multigrid methods using 
decoupled solution. Comparisons are presented in [13] and [14] for multigrid methods using the SCGS 
method and methods using the uncoupled MGPC method (Multigrid Pressure Correction) and the SPC 
(SIMPLE Pressure Correction) smoothing methods by means of local Fourier analysis as well as 
numerical experiments. It is stated in [17] that it is advantageous to use the coupled approach. 

However, both coupled and decoupled solution methods are widely used in practice. 
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The comparisons mentioned above are made for nonlinear multigrid methods in which the coarse 
grid operators are computed by using discretization coarse grid approximation (DCA). The relative 
merits of DCA and GCA are discussed in [21]. A nonlinear multigrid using DCA for the applications 
discussed in the present paper is presented in [8], [9], [10]. Here we apply GCA. Our main reason is that 
discretization of the Navier-Stokes equations in general coordinates on a staggered grid is a complicated 
affair, and GCA enables us to completely separate discretization and multigrid solution. In this paper, 
attention is focussed on smoothing. 


EQUATIONS AND DISCRETIZATION 


The incompressible Navier-Stokes equations in tensor formulation in general curvilinear coordinates 
are given as follows: 

dU a 

-gf + U»U% + 9° B vr Re-' (/’CTJ + y = B a , (1) 

u% = 0 . ( 2 ) 

Here U a , a = 1, 2, • • • , d, are the contravariant velocity components with d the number of space 
dimensions, p is the pressure, t is the time, B a is the contravariant component of the body force, and g Q ^ 
is the metric tensor. About tensor notation, see [2] for more details. U°p is die contravariant derivative. 
Readers not familar with tensor analysis can understand what is going on by assuming that Cartesian 
coordinates are used, and interpreting U% as dU a /dx p . U a and B a are defined by U a =a a • u, 

B = a • b, where u and b are the physical velocity vector and the body force, respectively, and a“ is 
the contravariant base vector of the general coordinate system. Let x = (x 1 , x 2 , • ■ • ,x d ) be a Cartesian 
coordinate system and £ = (£ ! , £ 2 , • • • , £ d ) be a general coordinate system. Then the contravariant base 
vector a Q is defined as a a = grad(£ Q ), and the metric tensor g a/3 is defined by g a0 = a“ • aP. It is found 
that to achieve better accuracy the variable V Q = ^/gU a should be used instead of U a ([7], [12], [22]), 
with yfg the Jacobian of the transformation x i — > £ : yTg — \dx a /d(?\. 


A finite volume discretization of equations (1) and (2) is presented in [7], [12], [22] on staggered 
grids in general coordinates. From now on we concentrate on two dimensions. Cells may be indexed by 
a two-tuple of integers i = (i u i 2 ) € G, Q = {1, 2, • • • , 1} x {1, 2, • • • , J}, with 7 and J the number of 
cells in the £*- and the £ 2 -direction. The index system for discrete variables is defined as follows. The 
V 1 variable at the center of the left face, the V 2 variable at the center of the lower face and the p 

variable at the center of a cell have the same index as the cell. Cells can be numbered in many ways. 

But unless indicated otherwise, we use the lexicographic order. Variables can also be numbered in 
different ways, for example, blockwise ordering. We use blockwise ordering for representatioiTof 
equations; orderings used in the smoothers may be different and are specified together with the 
smoothers. In blockwise ordering, V 1 , V 2 and p are ordered separately: 

(” ’ " » Vk * ' ‘ ' ) Vk > ^fc+u ’ ' ' iPk,Pk+ 1 » • • Let V = (V 1 , V 2 ), B = (J5 1 , B 2 ) and p represent the 

discrete velocity, the body force and the pressure grid functions, respectively. The discretization results 
in the following discrete system: 

J_yn+1 + 0Q'(V n+1 ) +0G P n+1 = ft»/(n+l) 

At J ’ (3) 

DV n+1 = fc(n+l) V ; 


692 



with 


p'("+i) = 0B n+1 + (1 - 6)B n + ^V n - (1 - 0)Q'(U n ) - (1 - 0)Gp n , ^ 

fc(n+l) _ q 

The superscript n denotes the time level. The parameter 9 is in [0,1]- The backward Euler method is 
obtained by setting 9 — 1, which is the method used in our numerical experiments. Note that (3) is 
nonlinear, and is to be solved by a linear multigrid method. Therefore it should be linearized outside 
multigrid iterations at each time level. Linearization with Newton’s method gives 

(U Q U 0 ) n+1 = (U a ) n+l (U 0 ) n + ( U a ) n {U p ) n+l - (U a U 0 ) n . (5) 


So we have 

Q'(V n+1 ) = QxV^ 1 + Q 2 (V n ) (6) 

with both Qi and Q 2 evaluated at time level n. Note that Qi is linear. As a consequence, using 
blockwise ordering, the linear system to be solved at each time level can be written as 


Kx = f, 


where 


with 


K = 


Q 8G 
D 0 


x = 


yn+l 

T » re +1 


, f = 


p;(n+l) 

fc(n+l) 


Q = -L + 9Qu r( n+1 > = P' (n+1) - 0Q 2 (V n ). 


A stationary solution is reached if 
is satisfied, where 


K.x = f a 


K. = 


Q' G 
D 0 




B 

fC 


(7) 

( 8 ) 

(9) 

( 10 ) 

( 11 ) 


THE SMOOTHING METHODS 


In this section, the four smoothing methods to be used, SCGS, CLGS, SILU and CILU, are 
described briefly. SCGS is of collective point GauB-Seidel type. It is a well-known fact that 
GauB-Seidel smoothing is not robust when cells in physical space are stretched, which occurs often in 
general boundary fitted coordinates. Line smoothers are better than point smoothers in handling such 
problems. Based on the idea of SCGS, a line version called SCAL is presented in [16]. Successful 
applications of the SCGS and the SCAL methods to problems in Cartesian coordinates can be found in, 
for instance, [4], [15], [16], [18]. Satisfactory results are also reported for problems in general 
coordinates ([8], [9], [10], [11]). The results show that SCAL seems to be more attractive than SCGS. 
Good smoothers may also be derived by employing ILU factorization. For a survey of ILU smoothers, 
see [20]. Two versions of ILU smoothers, called SILU and CILU, for the incompressible Navier-Stokes 
equations are presented in [23] and [30], respectively. 
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The SCGS Method 


The SCGS method updates variables cell by cell in a smoothing sweep, first in lexicographical order 
and then in backward lexicographical order. The five variables, say for cell i € Q, 

Vj 1 , V± +ei , Vi +e2 , Pi, which are located at the centers of the four cell faces and the center of cell i, 
are updated simultaneously, with e\ = (1, 0), e 2 = (0, 1). For convenience, we introduce e 0 = (0, 0). 

Let the array y contain the above five variables, and let the local system for the correction <5y of y be 
given by 

A<5y = c, (12) 

withc T = (cf,c£ ( 

Equation (8) can be written, in more detail, as 


f, c i+e 2 > c i) A a 5 x 5 matrix. The local system is formulated as follows. 


/Q 11 

Q 12 

0G 1 

\ 


f V 1 ' 

\ 

f pi \ 


Q 21 

Q 22 

6G 2 


. x = 

V 2 

’ f== 

P 2 . 

(13) 

D 1 

D 2 

0 

J 

' 

V p > 

1 

U c J 



K = 


c contains the residuals of the five equations corresponding to the five variables and is computed by 


cf = (P 1 - Q^V 1 - Q 12 V 2 - G a p)i, 


c vl 

u t+ei 

,v2 


cf = (P 2 - Q 21 V a - Q^V 2 - G 2 p) i , C i+e2 
c? = (f c - D^ 1 - D 2 V 2 )<. 

Using stencil notation ([21]), A can be written as 

/ Q n (i,e 0 ) 


= (P 1 - Q n V J - Q 12 V 2 - G 1 p) i+ei , 
= (P 2 - Q 21 V 1 - Q 22 V 2 - G 2 p)i +e2 , 


(14) 


A = 


Q n (* + ei,e 0 ) 


G 1 (i, e 0 ) 

G 1 ^' + c\, — ei) 

G 2 (i, e 0 ) 

Q 22 (i + e 2 , eo) G 2 (z + e 2 , -e 2 ) 
D 2 (i, e 2 ) 0 


\ 


\ eo) D^i.ei) D 2 (i,e 0 ) 

Equation (15) is solved analytically. The correction Sy is added immediately to y: 

y :=y+u>6y, 


(15) 


(16) 


where u; is an underrelaxation factor. 


The CLGS Method 


The CLGS method is in fact the same as the SCAL method proposed in [16], except that a 
smoothing sweep is composed of line GauB-Seidel in CLGS instead of alternating zebra in SCAL. So 
CLGS updates variables line by line successively. Let the vector y accommodate the variables for a 
whole horizontal z 2 -line of cells: 


v T = (• • • V- 1 V 2 V 2 v V 1 V 2 V 2 „ 

J V i v x » y i ’ K »+e2 , ^ f * ? K t+cj i *t+ei > K t+cj -f*C2 j rt+e\ j 

^i+2ei i ^i+2ei > K+2ei +e 2 > Pi+2en ' ' ‘ ) > * = (*1 > * 2 ) £ Q- 


(17) 


694 


Updating y gives horizontal line GauB-Seidel smoothing. Similarly, if the line is taken vertical for a 
fixed i\, we have the following arrangement of variables: 

V^ +2e2 ,V^ V i + 2e 2+ e^Pi+2e„- ‘ 0» * = (*!>**) G (l8) 

which gives vertical line GauB-Seidel smoothing. We will use forward horizontal line smoothing, unless 
indicated otherwise. Other types of line smoothings can be constructed easily by changing the sequence 
of visiting fines. Let the equation system for the correction 6y of y be denoted again by equation (12), 
which is readily derived from equation (7), with c the residuals of the equations corresponding to the 
variables in y. For a line, for example with i 2 fixed, the variables having the same ii are grouped 
together to form a 4-vector {V^V^V^Pi). This collective arrangement of variables results in a 
4x4 block matrix representation of the matrix A, which has non-zero elements (4x4 matnces) at 
positions (i u ii ± 1, i\ - 2) in the ii-th row of A. Solution of equation (12) can be carried out easily by 
using block LU factorization, which needs no further discussion. Updating is performed by (16). 
Because variables are collectively updated and fine GauB-Seidel relaxation is employed, this method is 

called CLGS. 


The SILU Method 


The SILU method is constructed as follows. Because K in (7) is indefinite, it is hard to find a 

regular splitting ([19]) . 

K = M - N ( 19 ) 

such that the classical iteration 

x i+1 = x* — M -1 (Kx* — b) (20) 

converges. Therefore, an r-transformation K is used ([23], [24], [25], [26]), and a regular splitting 

KK = M - N ( 21 ) 

is easier to find. Equation (21) corresponds to the following splitting of K: 

K = MR' 1 - NK -1 . ( 22 ) 

So with underrelaxation, the iteration (20) is revised as 

x <+1 = x‘ - wKM^Kx* _ b). (23) 

The matrix K chosen and the product KK are given by 


I -Q -1 GE -1 F \ K j> _ ( Q 0 

K=l 0 E-'F J ' KK _ \ D -F 

with E = DQ _1 G and F = DG. Since K involves the computation of Q _1 and E _1 , which is not 
practical, the following approximation K of K is applied: 


(24) 


K = 


I -G \ 

0 F _1 DQ _1 G j ’ 


(25) 
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( 26 ) 


where F = diag(DG) and Q = diag(Q). Hence we use: 

x i+1 = x* - wKM' ! (Kx' - b). 

Note that the approximation K of K is different from that used in [23], which is 

*-(;£)• < 27 > 
because we found with (27) the multigrid method does not work. The ILU factorization of KK uses a 
nine-point ILU factorization, in which the ordering of variables is nested, that is, 

(" ' > Yk> Vk iPk, Vk+iiVk+nPk+i , " •)• So equation (21) can be rewritten as 

KK = (L 4- D)D -1 (D + U) - N (28) 

with L and U stri ctly low er and upper triangular matrices and D a diagonal matrix. For the fc-th row of 
the matrix M, die non-zero pattern G for incomplete factorization is chosen to be a nine-point pattern 
G = {k, k ± 3, k ± 31, k ± 3/ ± 3, k =fc I 3) with I the number of cells in the ^-direction, and the 
elements of M in G are chosen to be equal to the corresponding elements of KK. In this paper^ this 
method is referred to as SILU because it works with scalar elements of matrices and to distinguish it 
from CILU, which works with block elements (here 3x3 matrices) and is explained now. 


The CILU Method 


CILU differs from SILU in two aspects: the choice of r-transformation and a collective treatment of 
unknowns. The r-transformation K and the corresponding KK are given by 


K = 


I -Q -1 C 

o ci 


KK = 


Q (C-i)g\ 

D -DQ _1 G j 


(29) 


Note that a parameter C is introduced. It is observed that C sometimes has significant effect on 
convergence (cf. [30]), but here for simplicity it is fixed at 2, which is found to be a good compromise 
for different problems. Obviously, K and KK both should be approximated since the computation of 
Q 1 is impracticable. They are approximated by: 


K = 


I -Q -1 G \ 

o Ci )' 


KK = 


Q (C-i)G\ 

D -DQ _1 G ) ’ 


respectively. KK is approximately factorized as follows: 


KK = M - N = (L + D)D -1 (D + U) - N. 


(30) 


(31) 


Similar to CLGS, variables are grouped together. For cell i, three variables having the same cell index 
are grouped in a 3-vector (V^ 1 , V^,Pi). Of course, this corresponds to nested ordering. This collective 
treatment of variables leads to a 3 x 3 block matrix representation of KK. The ILU factorization works 
with the 3 x 3 blocks as elements. Because of the collective treatment, we call the resulting ILU method 
CILU. In a typical row, for example row k, KK has non-zero elements (3x3 matrices) at positions 
(k, k H, k ± I, k ± I ± 1, k — 2, k + I — 2, k — 21, k — 21 + 1). We choose the following 

non-zero pattern G = (k, fe ± 1, fc dh 7, fc±/±l,A:ib/=pl) for the approximate factorization. 
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THE LINEAR MULTIGRID ALGORITHM 


The linear multigrid algorithm solves the linearized equation system (7) at each time step. The 
F-cycle will be used. The number of pre-smoothings and the number of post-smoothings are both 1. 

The coarsest grid will be as coarse as possible; the coarsest grid is 2 x 2 in our cases. A direct solver is 
applied on the coarsest grid. The transfer operators for prolongation may be different in the computation 
of coarse grid matrices by means of Galerkin coarse grid approximation and in the computation of 
coarse grid correction. For the computation of coarse grid matrices, the prolongation operators for the 
velocities in the momentum equations are bilinear interpolation, but the hybrid interpolation [28] is used 
in the continuity equation in order to preserve the structure of the matrix on every grid. The 
prolongation operator for the pressure is a piecewise constant interpolation. For completeness, we 
describe the hybrid interpolation here. Cell-centered coarsening is used, taking unions of four fine grid 
cells to form a coarse grid cell, as illustrated in figure 1. The correspondence between the numbering of 


- 

-2 i 

- 

- 


Figure 1. A cell of Q l and the corresponding four cells of Q l ; the grid points are indicated by 


the variables V 1 C U : G l ' — ♦ K on the coarse grid Q l and of V 1 C U : Q 1 ■ ♦ ft on the fine grid Q 

is also presented in this figure; coarse grid quantities are indicated by an overbar. The hybrid 
interpolation P 1 : U 1 i — ♦ U 1 is constructed by using linear interpolation in the -direction but zeroth 
order interpolation in the £ 2 -direction: 


we 2 we 
we 2 we 


where P 1 * is the adjoint of P 1 (cf. [21] for this way of specifying a prolongation). Here w = 0 when 
the “west” point refers to a point outside domain and w — 1 elsewhere, and similarly for e relative to 
“east” points. The underlined element indicates that the corresponding point has index 2i on the fine 
grid, if the operator is applied to point i on the coarse grid. The hybrid interpolation P for V is 
constructed similarly. Coarse grid correction is computed by using bilinear interpolation for the 
velocities and piecewise constant interpolation for the pressure. The restriction operators use the adjoint 
of the hybrid interpolation for the momentum equations and that of the piecewise constant interpolation 
for the continuity equation. More details about the choice of transfer operators are given in [28] and 
[31], and an efficient computation of Galerkin coarse grid approximation is presented m [29] and [31] 
for systems of equations. 
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Reduction factors are used as measure of the performance of multigrid. The average and the 
asymptotic reduction factor wiU be presented. Let r = ||r||, with r = f - Kx the residual of equation (7) 
and the norm ||r|| defined by 


IMI = (f™ - K-xtf/Jvj j , (33) 

with M the number of partial differential equations and N™ the number of grid points in Q m . At each 
time step we have a linearized equation system which is solved by a number of linear multigrid 
iterations. Let r 0 and r n denote respectively the residual norm before and after n cycles of multigrid 
iterations on the finest grid. The average reduction factor p is defined by 

P = (r„/r 0 )" . (34) 

The reduction factor at the i-th iteration is defined by 

Pi — fiAw. (35) 

If a limit of p* exists, then it is the asymptotic reduction factor. Define r a = ||r a ||, with r a = f, — K s x 
the residual of equation (10). A steady state is approximately obtained if 

< e < 1 (36) 

is satisfied, where r° is r a at time 0 and r* is r a at time t. The values of e are reported in figures 3-8. 

From the results of the following experiments, we choose the most robust method and undertake a 
further test, which aims at finding a proper choice of prolongation operators for the formulation of 
coarse grid operators. So the prolongation operators for the velocity in the momentum equations now 
use the hybrid interpolation for the velocities in the continuity equation. This specification of 
prolongation operators violates the well-known accuracy condition ([6]) for transfer operators. In [31], it 
is found that with such specification the multigrid method still works fine. The conclusion is that 
bilinear prolongation is better for low Reynolds number cases, whereas hybrid interpolation is better for 
high Reynolds number cases. With application to various test problems, which are described later, we 
perform some further experiments and try to select the best choice. 


NUMERICAL EXPERIMENTS AND RESULTS 


Three test problems are chosen, which are the square driven cavity problem, the skewed driven 
cavity problem and the L-shaped driven cavity problem, as illustrated in figure 2. These impose various 
difficulties. For brevity, we refer to the square driven cavity problem as problem 1 , the skewed driven 
cavity problem as problem 2, and the L-shaped driven cavity as problem 3. In problem 1, the grid is 
uniform Cartesian. This gives the simplest discretization, because stretched mesh cells and mixed 
derivatives do not occur. In problem 2, the grid is still uniform but the grid lines are not orthogonal, so 
mixed derivatives occur. Giving rise to more difficulties, problem 3 has a stretched non-uniform 
non-orthogonal grid. For each test problem, two Reynolds numbers are considered, Re = 1 and 
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Figure 2: The three test problems and corresponding grids: a. The square driven cavity problem, 
b. the skewed driven cavity problem; c. the L-shaped driven cavity problem; d. the computational 
domain of the L-shaped driven cavity problem 


Re = 1000. The two cases represent viscosity-dominated flows and (mostly) convection dominated 
flows. Benchmark solutions for Re = 1000 are provided in [3], [5], [11], respectively, for problems 1-3. 
All computations are carried out on an HP-730 computer. 

Prior to the measurement of reduction factors a linear system should be specified. It is natural to use 
equation (7) at steady state (more precisely, almost steady state). For Re = 1, 20 time steps with 
At = 1 are carried out to give the matrix and the right-hand side at the ‘steady state’, with each time 
step accompanied by one multigrid iteration. Only one multigrid iteration is used because we do not 
want to compute the real time history and so it is not necessary to solve the linear system at each time 
step very accurately. For Re = 1000, the number of time steps is changed to 100 with At = 0.2. The 
smoother used in the computations for the ‘steady states’ is CILU, with the underrelaxation parameter ui 
fixed at 0.7. A smaller time step is needed for larger Reynolds numbers to increase the main diagon 
because the discretization uses central differencing, which results in bad smoothing for Re and At being 
too large. Figures 3-8 present the streamlines of the test problems. They match well with the 
corresponding results in [3], [5], [11]. 

In order to determine the best performance of each smoother, the underrelaxation parameter is 
sampled at an interval 0.1 to find a good value. Tables 1-3 give the reduction factors for the multignd 
methods using different smoothers on the 128 x 128 grids corresponding to the best values of the 
underrelaxation factor u. If machineaccuracy is not reached, the reduction factors for the last 5 
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iterations are given, otherwise the reduction factors for the last 5 successive iterations before machine 
accuracy is reached. The maximum number of grid levels l f is also given; exceeding this causes the 
algorithm to fail. 

From these tables, we deduce the following observations. 


• The SCGS smoother works for all test cases. However, it is clearly very problem-sensitive. The 
underrelaxation parameter changes significantly as the problems and the Reynolds number change. 
For problem 3 and Re = 1000 it is very slow. For problem 2, overrelaxation has to be employed 
instead of underrelaxation. The number of grid levels must be reduced in problem 3, i.e., the 
coarse grids cannot be very coarse in order to obtain smoothing. 

• CLGS seems to work better than the SCGS smoother, because the underrelaxation parameter 
does not vary so much and usually the reduction factors are smaller. The one exceptional case is 
problem 1, where the number of grids has to be reduced by 1, even for Re = 1. What is more 
surprising in this case is that for Re. = 1000 convergence cannot be achieved with forward 
horizontal line smoothing. But using forward vertical line smoothing and strong Tamping, we 
recover convergence, which is, however, worse than that of the SCGS smoother. To improve the 
performance for this case, perhaps symmetrical line smoothing should be used. So further tested 
are symmetrical horizontal line smoothing (SHCLGS), symmetrical vertical line smoothing 
(SVCLGS) and symmetrical alternating line smoothing (SACLGS). It is found that SHCLGS and 
SACLGS both do not show any improvement, because the horizontal sweeps destroy smoothing 
seriously; SVCLGS gives some improvement, giving the best average reduction p = 0.5610 for 

u> = 0 . 2 . 

• With the SILU smoother, the underrelaxation parameter is less sensitive to change of problem and 
Reynolds number than with SCGS and CLGS, but the reduction factors are usually larger than 
those of SCGS and CLGS. The number of grid levels cannot exceed 4 or 5, otherwise the method 
does not work due to loss of smoothing on coarser grids. The well-known dependence of ILU 
smoothers on grid point ordering plays a role in problem 3. SILU is here found to be a bad 
smoother with lexicographic gnd point ordering. The results presented have been obtained with a 
backward ordering, starting from comer D (cf. figure 2d) and moving first down and then to the 
left. 

• The CILU smoother is not problem -sensitive. Very good convergence is obtained for all test 
problems. It is possible to fix the underrelaxation parameter at one value, which here is found to 
be 0.8. The dependence on the grid point ordering is pronounced for problem 3, for which the 
backward ordering described for SILU was used. 


According to the above observations, we can arrange the four smoothers in the following order from the 
best to worst: CILU, CLGS, SCGS, SILU. Of course, this conclusion is not general, because 
discretization and transfer operators both certainly affect the overall performance of an algorithm. 

Apart from robustness, efficiency should also be taken into account. Table 4 gives the CPU time in 
seconds per cycle (t c ) for the smoothers. The most robust smoother CILU takes twice as much time per 
cycle as the other three smoothers. The efficiency of two multigrid methods using two smoothers 


700 



(referred to as method 1 and method 2) may be compared as follows. Let the average reduction factor 
of method 1 be p\ and that of method 2 be p 2 , and let the CPU time per multigrid cycle be t c i and t c2 , 
respectively. For a required accuracy, for example a reduction c of the initial residual norm, method 1 
takes f c ilne/ In p\ CPU time and method 2 takes f c2 lne/lnp 2 CPU time. Define the efficiency factor 
Ej of method 1 with respect to method 2 by 


= t e2 l npi ( 37 ) 

1 t cl In p 2 

So if Ej > 1 , then method 1 is more efficient; if Ej < 1 then method 2 is more efficient. For 
comparisons among more than 2 methods, one of them is used as a standard, in place of method 2. 

Using p and t c given in tables 1-4 and taking CILU as the standard for the comparison, table 5 presents 
Ej in all the test cases. Bigger numbers mean higher efficiency. Apparently, The SCGS smoother and 
the CLGS smoother are mostly more, but not very much, efficient than the CILU smoother; the SILU 
smoother is mostly less efficient. Because SCGS and CLGS can be easily altered to parallellizable 
versions by using black-white or zebra ordering, one may argue that SCGS and CLGS have more 
parallellization potential than CILU, and higher efficiency can be obtained. But this may be true only in 
two dimensions. 

Now with CILU, we investigate convergence of the multigrid method using the hybrid interpolation 
instead of bilinear interpolation for the velocities in the momentum equations in the formulation of 
coarse grid operators. The results are given in table 6 in terms of the reduction factors for the best 
values of uj. Clearly, the method works much better for Re = 1000 than for Re — 1. Using the hybrid 
prolongation for Re = 1000 the method performs equally as well as the method using the bilinear 
prolongation. It is easy to see that for low Reynolds number cases bilinear prolongation is better, but 
this is not so clear for high Reynolds number cases. We found that for high Reynolds numbers there axe 
some cases in which bilinear prolongation does not work but the hybrid prolongation still works well. 
Therefore it is safer to use the hybrid prolongation for high Reynolds numbers. One may conclude again 
that the hybrid prolongation is more suitable for high Reynolds numbers and bilinear prolongation is 
more suitable for low Reynolds numbers. 


CONCLUSIONS 


The performance of the multigrid method using SCGS, CLGS, SILU and CILU smoothers are 
studied for the incompressible Navier-Stokes equations in general coordinates. Galerkin coarse grid 
approximation is used in the computation of coarse grid matrices. Both robustness and efficiency of the 
methods are investigated and measured in terms of reduction factors and efficiency factors. The results 
show that the most robust smoother is CILU; CLGS and SCGS follow. SILU is the worst. For 
efficiency, the order from the best to the worst is CLGS, SCGS, CILU and SILU. Although CILU is 
somewhat less efficient than CLGS and SCGS and it has less parallellization potential in two 
dimensions, it may be more promising in three dimensions because it is much more robust than all the 
others and parallellization can also be established among planes. 
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For prolongation operators in the computation of coarse grid operators, the hybrid interpolation is a 
more appropriate choice for high Reynolds numbers, whereas bilinear interpolation is a more appropriate 
choice for low Reynolds numbers. 
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Figure 3: Streamlines for problem 1, Figure 4: Streamlines for problem 1, 

Re = 1, r*/r2 < 4.581 x 10" n Re = 1000, r‘/r° < 1.804 x 10" 3 



Figure 5: Streamlines for problem 2, Figure 6: Streamlines for problem 2, 

Re = 1, r*/r° < 4.358 x IQ- 10 Re = 1000, r*/r2 < 4.484 x 10" 6 





Table 1: Reduction Factors Corresponding to the Best Values of w for Problem 1 on the 128 x 128 Grid 


Smoother 

SCGS 

CLGS 

SILU 

CILU 

1 SCGS 

CLGS 

SILU 

CILU 

Re = 1, ro = 12.96 

| Re = 1000, r 0 = 1.605 x 10“ 02 

(Jj 

0.8 

1.0 

0.9 

1.0 

0.3 

0.1* 

0.7 

1.0 

h 

7 

6 

4 

7 

7 

6 

4 

7 

i 

16 

16 

16 

15 

16 

16 

16 

16 

pi 

0.2787 

0.2183 

0.3708 

0.2026 

0.6464 

0.8344 

0.8168 

0.4122 

Pi+ 1 

0.2811 

0.2184 

0.3761 

0.2079 

0.6420 

0.8950 

0.8009 

0.4116 

Pi+ 2 

0.2789 

0.2237 

0.3807 

0.2142 

0.5994 

0.8735 

0.8244 

0.4136 

Pi + 3 

0.2816 

0.2235 

0.3849 

0.2224 

0.6088 

0.9048 

0.8846 

0.4155 

Pi+i 

0.2791 

0.2300 

0.3880 

0.2393 

0.5869 

0.8899 

0.9346 

0.4131 

P 

0.2561 

0.1973 

0.2863 

0.1732 

0.4918 

0.7773 

0.7005 

0.2996 


* Forward vertical smoothing 


Table 2: Reduction Factors Corresponding to the Best Values of u for Problem 2 on the 128 x 128 Grid 


Smoother 

SCGS 

CLGS 

SILU 

CILU 

SCGS 

CLGS 

SILU 

CILU 

Re = 1, r 0 = 25.92 

Re = 1000, r 0 = 2.697 x 10 02 

U> 

1.2 

0.9 

0.9 

0.8 

1.2 

1.0 

0.8 

0.9 

h 

7 

7 

4 

7 

7 

7 

4 

7 

i 

16 

16 

16 

16 

16 

16 

16 

16 

pi 

0.3377 

0.3476 

0.7401 

0.3857 

0.3693 

0.4166 

0.7784 

0.3052 

Pi + 1 

0.3406 

0.3492 

0.7418 

0.3885 

0.3650 

0.4167 

0.7798 

0.3190 

Pi+2 

0.3452 

0.3512 

0.7437 

0.3911 

0.3613 

0.4173 

0.7811 

0.3312 

P t+3 

0.3432 

0.3536 

0.7456 

0.3931 

0.3582 

0.4180 

0.7825 

0.3399 

Pi + 4 

0.3463 

0.3563 

0.7476 

0.3942~ 

0.3558 

0.4184 

0.7839 

0.3447 

p 

0.3032 

0.3205 

0.6315 

0.2968 

0.3472 

0.3941 

0.6716 

0.2802 
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Table 3: Reduction Factors Corresponding to the Best Values of w for Problem 3 on the 128 x 128 Grid 


Smoother 

SCGS 

CLGS 

SILU 

CILU 

SCGS 

CLGS 

SILU 

CILU 


Re = 1 

, T-o = 18.20 | 

M 

? 

o 

o 

o 

rH 

II 

£ 

= 1.969 x 

l 0 -°2 


1.0 

0.9 

0.8* 

0.8* 

0.1 

0.4 

0.2* 

0.8* 

If 

6 

7 

5 

7 

5 

7 

5 

7 

i 

16 

15 

16 

16 

16 

16 

16 

16 

Pi 

0.7302 

0.2320 

0.5960 

0.6997 

0.9381 

0.6527 

0.9293] 

0.3496 

Pi+ 1 

0.7319 

0.1699 

0.5878 

0.4104 

0.9399 

0.6354 

0.9337 

0.3355 

Pi+ 2 

0.7334 

0.2131 

0.5914 

0.2317 

0.9400 

0.6425 

0.9376 

0.3344 

Pi+ 3 

0.7347 

0.1941 

0.5927 

0.6643 

0.9383 

0.6536 

0.9411 

0.3448 

P»4-4 

0.7359 

0.2614 

0.5909 

0.4450 

0.9352 

0.6386 

0.9442 

0.3292 

p 

0.5914 

0.1645 

0.4992 

0.3673 

r _ * J 

0.7815 

0.5422 

0.8183 

0.2795 


* Backward lexicographical ordering of grid points 


Table 4: CPU Time Needed by One Multigrid 
Cycle on 128 x 128 Grid 


Smoother 

SCGS 

CLGS 

SILU 

CILU 

to 

25.0 

23.4 

28.9 

56.3 


Table 5. The Efficiency Factor Ej for All Test Cases 


Smoother 

SCGS 

CLGS 

SILU 

CILU 

SCGS 

CLGS 

SILU 

CILU 




S3 

ft 

II 

h- * 

Re = 

1000 


Problem 1 

1.7 

2.2 

1.4 

1.0 

1.3 

0.5 

0.6 1 

1.0 

Problem 2 

2.2 

2.3 

0.7 

1.0 

1.9 

1.8 

0.6 

1.0 

Problem 3 

1.2 

4.3 

1.4 

1.0 

0.4 

1.2 

0.3 

1.0 
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Table 6: Reduction Factors of the Multigrid Method Using CILU and the Hybrid 
Prolongation for Various Problems on the 128 x 128 Grids 




0.3837 0.3278 


0.2697 x 10~ 01 

0.1969 x 10- 01 

0.8231 x 10~ 12 

0.3485 x 10" 12 

0.2980 

0.2900 


Backward lexicographical ordering of grid points 
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